I have a proposition for you it will be a mutual benefit to both of us
-- Dear Sir, I am David Williams. I have a Proposition involving an investment initiative in your Country to discuss with you, It will be of mutual benefit to both of us, and I believe we can handle it together, once we have a common understanding and mutual co-operation in the execution of the modalities. I am a staff with a Commercial Bank and I discovered a dormant account, it involves the account of Late Mr.Richard Cousin, please If you should be interested e-mail back to me through this email address: officei...@citromail.hu) for more details. Please If you receive this message in your spam or junk its due to your network provider I await your earliest response. Yours Sincerely, David Williams E-mE-mail: officei...@citromail.hu
Mutual Benefit
Hi, I am contacting you with regards to using your last name for funds claim of long dormant funds belonging to a late depositor . Let me know if you are interested for more details . Best Regards David Lin
My name is Francis Aaron,I have a financial proposal which will benefit both of us.If you are interested,get back to me for more details.Thanks
Mutual Benefit
Hi, I am contacting you with regards to using your last name for funds claim of long dormant funds belonging to a late depositor . Let me know if you are intreasted for more details . Best Regards David Lin
mutual Benefit
I am Sir Donald Hopkins working in fortune wine company UK. I have profitable business proposal for you as one of the product my company use in manufacturing wine is found in India,i will give you more business transaction once i get a response from you Best regard Donald Hopkins Departmemt manager Fortune Wine Company ukI am Sir Donald Hopkins working in fortune wine company UK. I have profitable business proposal for you as one of the product my company use in manufacturing wine is found in India,i will give you more business transaction once i get a response from you Best regard Donald Hopkins Departmemt manager Fortune Wine Company uk
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Tue, Oct 23, 2018 at 03:13:43PM +, Raslan, KarimAllah wrote: > On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote: > > On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote: > > > > > > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote: > > > > > > > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > > > > > > > > > > > > > > > When expedited grace-period is set, both synchronize_sched > > > > > synchronize_rcu_bh can be optimized to have a significantly lower > > > > > latency. > > > > > > > > > > Improve wait_rcu_gp handling to also account for expedited > > > > > grace-period. > > > > > The downside is that wait_rcu_gp will not wait anymore for all RCU > > > > > variants > > > > > concurrently when an expedited grace-period is set, however, given the > > > > > improved latency it does not really matter. > > > > > > > > > > Cc: Paul E. McKenney > > > > > Cc: Josh Triplett > > > > > Cc: Steven Rostedt > > > > > Cc: Mathieu Desnoyers > > > > > Cc: Lai Jiangshan > > > > > Cc: linux-kernel@vger.kernel.org > > > > > Signed-off-by: KarimAllah Ahmed > > > > > > > > Cute! > > > > > > > > Unfortunately, there are a few problems with this patch: > > > > > > > > 1. I will be eliminating synchronize_rcu_mult() due to the fact > > > > that > > > > the upcoming RCU flavor consolidation eliminates its sole > > > > caller. > > > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") > > > > in my -rcu tree. This would of course also eliminate the > > > > effects > > > > of this patch. > > > > > > Your patch covers our use-case already, but I still think that the > > > semantics > > > for wait_rcu_gp is not clear to me. > > > > > > The problem for us was that sched_cpu_deactivate would call > > > synchronize_rcu_mult which does not check for "expedited" at all. So even > > > though we are already using rcu_expedited sysctl variable, > > > synchronize_rcu_mult > > > was just ignoring it. > > > > > > That being said, I indeed overlooked rcu_normal and that it takes > > > precedence > > > over expedited and I did not notice at all the deadlock you mentioned > > > below! > > > > > > That can however be easily fixed by also checking for !rcu_gp_is_normal. > > > > ??? > > > > The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult() > > with synchronize_rcu(), which really would be subject to the sysfs > > variables. Of course, this is not yet in mainline, so it perhaps cannot > > solve your immediate problem, which probably involve older kernels in > > any case. More on this below... > > > > > > > > > > > > > 2. The real-time guys' users are not going to be at all happy > > > > with the IPIs resulting from the _expedited() API members. > > > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't > > > > always need that big a hammer, and use of this kernel parameter > > > > can slow down boot, hibernation, suspend, network configuration, > > > > and much else besides. We therefore don't want them to have to > > > > use rcupdate.rcu_normal=1 unless absolutely necessary. > > > > > > I might be missing something here. Why would they need to "explicitly" > > > use > > > rcu_normal? If rcu_expedited is set, would not the expected behavior is > > > to call > > > into the expedited version? > > > > > > My patch should only activate *expedited* if and only if it is set. > > > > You are right, I was confused. However... > > > > > > > > I think I might be misunderstanding the expected behavior > > > from synchronize_rcu_mult. My understanding is that something like: > > > > > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an > > > identical behavior, right? > > > > You would clearly prefer that it did, and the commit log does seem to > > read that way, but synchronize_rcu_mult() is going away anyway, so there > > isn't a whole lot of point in arguing about what it should have done. > > And the eventual implementation (with 5fc9d4e000b1 or its successor) > > will act as you want. > > > > > > > > At least in this commit: > > > > > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() > > > de-duplication") > > > > > > .. the change clearly gives the impression that they can be used > > > interchangeably. The problem is that this is not true when you look at > > > the > > > implementation. One of them (i.e. synchronize_rcu) will respect the > > > expedite_rcu flag set by sysfs while the other (i.e. > > > synchronize_rcu_mult) > > > simply ignores it. > > > > > > So my patch is about making sure that both of the variants actually > > > respect > > > it. > > > > I am guessing that you need to make an older version of the kernel > > expedite the CPU-hotplug grace periods. I am also guessing that you can > > carry patches to your kernels. If so, I suggest the
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Tue, Oct 23, 2018 at 03:13:43PM +, Raslan, KarimAllah wrote: > On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote: > > On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote: > > > > > > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote: > > > > > > > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > > > > > > > > > > > > > > > When expedited grace-period is set, both synchronize_sched > > > > > synchronize_rcu_bh can be optimized to have a significantly lower > > > > > latency. > > > > > > > > > > Improve wait_rcu_gp handling to also account for expedited > > > > > grace-period. > > > > > The downside is that wait_rcu_gp will not wait anymore for all RCU > > > > > variants > > > > > concurrently when an expedited grace-period is set, however, given the > > > > > improved latency it does not really matter. > > > > > > > > > > Cc: Paul E. McKenney > > > > > Cc: Josh Triplett > > > > > Cc: Steven Rostedt > > > > > Cc: Mathieu Desnoyers > > > > > Cc: Lai Jiangshan > > > > > Cc: linux-kernel@vger.kernel.org > > > > > Signed-off-by: KarimAllah Ahmed > > > > > > > > Cute! > > > > > > > > Unfortunately, there are a few problems with this patch: > > > > > > > > 1. I will be eliminating synchronize_rcu_mult() due to the fact > > > > that > > > > the upcoming RCU flavor consolidation eliminates its sole > > > > caller. > > > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") > > > > in my -rcu tree. This would of course also eliminate the > > > > effects > > > > of this patch. > > > > > > Your patch covers our use-case already, but I still think that the > > > semantics > > > for wait_rcu_gp is not clear to me. > > > > > > The problem for us was that sched_cpu_deactivate would call > > > synchronize_rcu_mult which does not check for "expedited" at all. So even > > > though we are already using rcu_expedited sysctl variable, > > > synchronize_rcu_mult > > > was just ignoring it. > > > > > > That being said, I indeed overlooked rcu_normal and that it takes > > > precedence > > > over expedited and I did not notice at all the deadlock you mentioned > > > below! > > > > > > That can however be easily fixed by also checking for !rcu_gp_is_normal. > > > > ??? > > > > The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult() > > with synchronize_rcu(), which really would be subject to the sysfs > > variables. Of course, this is not yet in mainline, so it perhaps cannot > > solve your immediate problem, which probably involve older kernels in > > any case. More on this below... > > > > > > > > > > > > > 2. The real-time guys' users are not going to be at all happy > > > > with the IPIs resulting from the _expedited() API members. > > > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't > > > > always need that big a hammer, and use of this kernel parameter > > > > can slow down boot, hibernation, suspend, network configuration, > > > > and much else besides. We therefore don't want them to have to > > > > use rcupdate.rcu_normal=1 unless absolutely necessary. > > > > > > I might be missing something here. Why would they need to "explicitly" > > > use > > > rcu_normal? If rcu_expedited is set, would not the expected behavior is > > > to call > > > into the expedited version? > > > > > > My patch should only activate *expedited* if and only if it is set. > > > > You are right, I was confused. However... > > > > > > > > I think I might be misunderstanding the expected behavior > > > from synchronize_rcu_mult. My understanding is that something like: > > > > > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an > > > identical behavior, right? > > > > You would clearly prefer that it did, and the commit log does seem to > > read that way, but synchronize_rcu_mult() is going away anyway, so there > > isn't a whole lot of point in arguing about what it should have done. > > And the eventual implementation (with 5fc9d4e000b1 or its successor) > > will act as you want. > > > > > > > > At least in this commit: > > > > > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() > > > de-duplication") > > > > > > .. the change clearly gives the impression that they can be used > > > interchangeably. The problem is that this is not true when you look at > > > the > > > implementation. One of them (i.e. synchronize_rcu) will respect the > > > expedite_rcu flag set by sysfs while the other (i.e. > > > synchronize_rcu_mult) > > > simply ignores it. > > > > > > So my patch is about making sure that both of the variants actually > > > respect > > > it. > > > > I am guessing that you need to make an older version of the kernel > > expedite the CPU-hotplug grace periods. I am also guessing that you can > > carry patches to your kernels. If so, I suggest the
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote: > On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote: > > > > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote: > > > > > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > > > > > > > > > > > > When expedited grace-period is set, both synchronize_sched > > > > synchronize_rcu_bh can be optimized to have a significantly lower > > > > latency. > > > > > > > > Improve wait_rcu_gp handling to also account for expedited grace-period. > > > > The downside is that wait_rcu_gp will not wait anymore for all RCU > > > > variants > > > > concurrently when an expedited grace-period is set, however, given the > > > > improved latency it does not really matter. > > > > > > > > Cc: Paul E. McKenney > > > > Cc: Josh Triplett > > > > Cc: Steven Rostedt > > > > Cc: Mathieu Desnoyers > > > > Cc: Lai Jiangshan > > > > Cc: linux-kernel@vger.kernel.org > > > > Signed-off-by: KarimAllah Ahmed > > > > > > Cute! > > > > > > Unfortunately, there are a few problems with this patch: > > > > > > 1.I will be eliminating synchronize_rcu_mult() due to the fact > > > that > > > the upcoming RCU flavor consolidation eliminates its sole caller. > > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") > > > in my -rcu tree. This would of course also eliminate the effects > > > of this patch. > > > > Your patch covers our use-case already, but I still think that the > > semantics > > for wait_rcu_gp is not clear to me. > > > > The problem for us was that sched_cpu_deactivate would call > > synchronize_rcu_mult which does not check for "expedited" at all. So even > > though we are already using rcu_expedited sysctl variable, > > synchronize_rcu_mult > > was just ignoring it. > > > > That being said, I indeed overlooked rcu_normal and that it takes > > precedence > > over expedited and I did not notice at all the deadlock you mentioned below! > > > > That can however be easily fixed by also checking for !rcu_gp_is_normal. > > ??? > > The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult() > with synchronize_rcu(), which really would be subject to the sysfs > variables. Of course, this is not yet in mainline, so it perhaps cannot > solve your immediate problem, which probably involve older kernels in > any case. More on this below... > > > > > > > > > 2.The real-time guys' users are not going to be at all happy > > > with the IPIs resulting from the _expedited() API members. > > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't > > > always need that big a hammer, and use of this kernel parameter > > > can slow down boot, hibernation, suspend, network configuration, > > > and much else besides. We therefore don't want them to have to > > > use rcupdate.rcu_normal=1 unless absolutely necessary. > > > > I might be missing something here. Why would they need to "explicitly" use > > rcu_normal? If rcu_expedited is set, would not the expected behavior is to > > call > > into the expedited version? > > > > My patch should only activate *expedited* if and only if it is set. > > You are right, I was confused. However... > > > > > I think I might be misunderstanding the expected behavior > > from synchronize_rcu_mult. My understanding is that something like: > > > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an > > identical behavior, right? > > You would clearly prefer that it did, and the commit log does seem to > read that way, but synchronize_rcu_mult() is going away anyway, so there > isn't a whole lot of point in arguing about what it should have done. > And the eventual implementation (with 5fc9d4e000b1 or its successor) > will act as you want. > > > > > At least in this commit: > > > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() > > de-duplication") > > > > .. the change clearly gives the impression that they can be used > > interchangeably. The problem is that this is not true when you look at the > > implementation. One of them (i.e. synchronize_rcu) will respect the > > expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) > > simply ignores it. > > > > So my patch is about making sure that both of the variants actually respect > > it. > > I am guessing that you need to make an older version of the kernel > expedite the CPU-hotplug grace periods. I am also guessing that you can > carry patches to your kernels. If so, I suggest the following simpler > change to sched_cpu_deactivate() in kernel/sched/core.c: > > if (rcu_gp_is_expedited()) { > synchronize_sched_expedited(); > if (IS_ENABLED(CONFIG_PREEMPT)) > synchronize_rcu_expedited(); > } else { > synchronize_rcu_mult(call_rcu, call_rcu_sched); > } > > As soon as this patch conflicts due to the
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote: > On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote: > > > > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote: > > > > > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > > > > > > > > > > > > When expedited grace-period is set, both synchronize_sched > > > > synchronize_rcu_bh can be optimized to have a significantly lower > > > > latency. > > > > > > > > Improve wait_rcu_gp handling to also account for expedited grace-period. > > > > The downside is that wait_rcu_gp will not wait anymore for all RCU > > > > variants > > > > concurrently when an expedited grace-period is set, however, given the > > > > improved latency it does not really matter. > > > > > > > > Cc: Paul E. McKenney > > > > Cc: Josh Triplett > > > > Cc: Steven Rostedt > > > > Cc: Mathieu Desnoyers > > > > Cc: Lai Jiangshan > > > > Cc: linux-kernel@vger.kernel.org > > > > Signed-off-by: KarimAllah Ahmed > > > > > > Cute! > > > > > > Unfortunately, there are a few problems with this patch: > > > > > > 1.I will be eliminating synchronize_rcu_mult() due to the fact > > > that > > > the upcoming RCU flavor consolidation eliminates its sole caller. > > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") > > > in my -rcu tree. This would of course also eliminate the effects > > > of this patch. > > > > Your patch covers our use-case already, but I still think that the > > semantics > > for wait_rcu_gp is not clear to me. > > > > The problem for us was that sched_cpu_deactivate would call > > synchronize_rcu_mult which does not check for "expedited" at all. So even > > though we are already using rcu_expedited sysctl variable, > > synchronize_rcu_mult > > was just ignoring it. > > > > That being said, I indeed overlooked rcu_normal and that it takes > > precedence > > over expedited and I did not notice at all the deadlock you mentioned below! > > > > That can however be easily fixed by also checking for !rcu_gp_is_normal. > > ??? > > The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult() > with synchronize_rcu(), which really would be subject to the sysfs > variables. Of course, this is not yet in mainline, so it perhaps cannot > solve your immediate problem, which probably involve older kernels in > any case. More on this below... > > > > > > > > > 2.The real-time guys' users are not going to be at all happy > > > with the IPIs resulting from the _expedited() API members. > > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't > > > always need that big a hammer, and use of this kernel parameter > > > can slow down boot, hibernation, suspend, network configuration, > > > and much else besides. We therefore don't want them to have to > > > use rcupdate.rcu_normal=1 unless absolutely necessary. > > > > I might be missing something here. Why would they need to "explicitly" use > > rcu_normal? If rcu_expedited is set, would not the expected behavior is to > > call > > into the expedited version? > > > > My patch should only activate *expedited* if and only if it is set. > > You are right, I was confused. However... > > > > > I think I might be misunderstanding the expected behavior > > from synchronize_rcu_mult. My understanding is that something like: > > > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an > > identical behavior, right? > > You would clearly prefer that it did, and the commit log does seem to > read that way, but synchronize_rcu_mult() is going away anyway, so there > isn't a whole lot of point in arguing about what it should have done. > And the eventual implementation (with 5fc9d4e000b1 or its successor) > will act as you want. > > > > > At least in this commit: > > > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() > > de-duplication") > > > > .. the change clearly gives the impression that they can be used > > interchangeably. The problem is that this is not true when you look at the > > implementation. One of them (i.e. synchronize_rcu) will respect the > > expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) > > simply ignores it. > > > > So my patch is about making sure that both of the variants actually respect > > it. > > I am guessing that you need to make an older version of the kernel > expedite the CPU-hotplug grace periods. I am also guessing that you can > carry patches to your kernels. If so, I suggest the following simpler > change to sched_cpu_deactivate() in kernel/sched/core.c: > > if (rcu_gp_is_expedited()) { > synchronize_sched_expedited(); > if (IS_ENABLED(CONFIG_PREEMPT)) > synchronize_rcu_expedited(); > } else { > synchronize_rcu_mult(call_rcu, call_rcu_sched); > } > > As soon as this patch conflicts due to the
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote: > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote: > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > > > > > > When expedited grace-period is set, both synchronize_sched > > > synchronize_rcu_bh can be optimized to have a significantly lower latency. > > > > > > Improve wait_rcu_gp handling to also account for expedited grace-period. > > > The downside is that wait_rcu_gp will not wait anymore for all RCU > > > variants > > > concurrently when an expedited grace-period is set, however, given the > > > improved latency it does not really matter. > > > > > > Cc: Paul E. McKenney > > > Cc: Josh Triplett > > > Cc: Steven Rostedt > > > Cc: Mathieu Desnoyers > > > Cc: Lai Jiangshan > > > Cc: linux-kernel@vger.kernel.org > > > Signed-off-by: KarimAllah Ahmed > > > > Cute! > > > > Unfortunately, there are a few problems with this patch: > > > > 1. I will be eliminating synchronize_rcu_mult() due to the fact that > > the upcoming RCU flavor consolidation eliminates its sole caller. > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") > > in my -rcu tree. This would of course also eliminate the effects > > of this patch. > > Your patch covers our use-case already, but I still think that the semantics > for wait_rcu_gp is not clear to me. > > The problem for us was that sched_cpu_deactivate would call > synchronize_rcu_mult which does not check for "expedited" at all. So even > though we are already using rcu_expedited sysctl variable, > synchronize_rcu_mult > was just ignoring it. > > That being said, I indeed overlooked rcu_normal and that it takes precedence > over expedited and I did not notice at all the deadlock you mentioned below! > > That can however be easily fixed by also checking for !rcu_gp_is_normal. ??? The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult() with synchronize_rcu(), which really would be subject to the sysfs variables. Of course, this is not yet in mainline, so it perhaps cannot solve your immediate problem, which probably involve older kernels in any case. More on this below... > > 2. The real-time guys' users are not going to be at all happy > > with the IPIs resulting from the _expedited() API members. > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't > > always need that big a hammer, and use of this kernel parameter > > can slow down boot, hibernation, suspend, network configuration, > > and much else besides. We therefore don't want them to have to > > use rcupdate.rcu_normal=1 unless absolutely necessary. > > I might be missing something here. Why would they need to "explicitly" use > rcu_normal? If rcu_expedited is set, would not the expected behavior is to > call > into the expedited version? > > My patch should only activate *expedited* if and only if it is set. You are right, I was confused. However... > I think I might be misunderstanding the expected behavior > from synchronize_rcu_mult. My understanding is that something like: > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an > identical behavior, right? You would clearly prefer that it did, and the commit log does seem to read that way, but synchronize_rcu_mult() is going away anyway, so there isn't a whole lot of point in arguing about what it should have done. And the eventual implementation (with 5fc9d4e000b1 or its successor) will act as you want. > At least in this commit: > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication") > > .. the change clearly gives the impression that they can be used > interchangeably. The problem is that this is not true when you look at the > implementation. One of them (i.e. synchronize_rcu) will respect the > expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) > simply ignores it. > > So my patch is about making sure that both of the variants actually respect > it. I am guessing that you need to make an older version of the kernel expedite the CPU-hotplug grace periods. I am also guessing that you can carry patches to your kernels. If so, I suggest the following simpler change to sched_cpu_deactivate() in kernel/sched/core.c: if (rcu_gp_is_expedited()) { synchronize_sched_expedited(); if (IS_ENABLED(CONFIG_PREEMPT)) synchronize_rcu_expedited(); } else { synchronize_rcu_mult(call_rcu, call_rcu_sched); } As soon as this patch conflicts due to the synchronize_rcu_mult() becoming synchronize_rcu(), you can drop the patch. And this is the only use of synchronize_rcu_mult(), so this approach loses no generality. Longer term, this patch might possibly be the backport of 5fc9d4e000b1 back to v4.14, but at the end of the day this is up to the various -stable maintainers. Hmmm... If
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote: > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote: > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > > > > > > When expedited grace-period is set, both synchronize_sched > > > synchronize_rcu_bh can be optimized to have a significantly lower latency. > > > > > > Improve wait_rcu_gp handling to also account for expedited grace-period. > > > The downside is that wait_rcu_gp will not wait anymore for all RCU > > > variants > > > concurrently when an expedited grace-period is set, however, given the > > > improved latency it does not really matter. > > > > > > Cc: Paul E. McKenney > > > Cc: Josh Triplett > > > Cc: Steven Rostedt > > > Cc: Mathieu Desnoyers > > > Cc: Lai Jiangshan > > > Cc: linux-kernel@vger.kernel.org > > > Signed-off-by: KarimAllah Ahmed > > > > Cute! > > > > Unfortunately, there are a few problems with this patch: > > > > 1. I will be eliminating synchronize_rcu_mult() due to the fact that > > the upcoming RCU flavor consolidation eliminates its sole caller. > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") > > in my -rcu tree. This would of course also eliminate the effects > > of this patch. > > Your patch covers our use-case already, but I still think that the semantics > for wait_rcu_gp is not clear to me. > > The problem for us was that sched_cpu_deactivate would call > synchronize_rcu_mult which does not check for "expedited" at all. So even > though we are already using rcu_expedited sysctl variable, > synchronize_rcu_mult > was just ignoring it. > > That being said, I indeed overlooked rcu_normal and that it takes precedence > over expedited and I did not notice at all the deadlock you mentioned below! > > That can however be easily fixed by also checking for !rcu_gp_is_normal. ??? The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult() with synchronize_rcu(), which really would be subject to the sysfs variables. Of course, this is not yet in mainline, so it perhaps cannot solve your immediate problem, which probably involve older kernels in any case. More on this below... > > 2. The real-time guys' users are not going to be at all happy > > with the IPIs resulting from the _expedited() API members. > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't > > always need that big a hammer, and use of this kernel parameter > > can slow down boot, hibernation, suspend, network configuration, > > and much else besides. We therefore don't want them to have to > > use rcupdate.rcu_normal=1 unless absolutely necessary. > > I might be missing something here. Why would they need to "explicitly" use > rcu_normal? If rcu_expedited is set, would not the expected behavior is to > call > into the expedited version? > > My patch should only activate *expedited* if and only if it is set. You are right, I was confused. However... > I think I might be misunderstanding the expected behavior > from synchronize_rcu_mult. My understanding is that something like: > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an > identical behavior, right? You would clearly prefer that it did, and the commit log does seem to read that way, but synchronize_rcu_mult() is going away anyway, so there isn't a whole lot of point in arguing about what it should have done. And the eventual implementation (with 5fc9d4e000b1 or its successor) will act as you want. > At least in this commit: > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication") > > .. the change clearly gives the impression that they can be used > interchangeably. The problem is that this is not true when you look at the > implementation. One of them (i.e. synchronize_rcu) will respect the > expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) > simply ignores it. > > So my patch is about making sure that both of the variants actually respect > it. I am guessing that you need to make an older version of the kernel expedite the CPU-hotplug grace periods. I am also guessing that you can carry patches to your kernels. If so, I suggest the following simpler change to sched_cpu_deactivate() in kernel/sched/core.c: if (rcu_gp_is_expedited()) { synchronize_sched_expedited(); if (IS_ENABLED(CONFIG_PREEMPT)) synchronize_rcu_expedited(); } else { synchronize_rcu_mult(call_rcu, call_rcu_sched); } As soon as this patch conflicts due to the synchronize_rcu_mult() becoming synchronize_rcu(), you can drop the patch. And this is the only use of synchronize_rcu_mult(), so this approach loses no generality. Longer term, this patch might possibly be the backport of 5fc9d4e000b1 back to v4.14, but at the end of the day this is up to the various -stable maintainers. Hmmm... If
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote: > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > > > > When expedited grace-period is set, both synchronize_sched > > synchronize_rcu_bh can be optimized to have a significantly lower latency. > > > > Improve wait_rcu_gp handling to also account for expedited grace-period. > > The downside is that wait_rcu_gp will not wait anymore for all RCU variants > > concurrently when an expedited grace-period is set, however, given the > > improved latency it does not really matter. > > > > Cc: Paul E. McKenney > > Cc: Josh Triplett > > Cc: Steven Rostedt > > Cc: Mathieu Desnoyers > > Cc: Lai Jiangshan > > Cc: linux-kernel@vger.kernel.org > > Signed-off-by: KarimAllah Ahmed > > Cute! > > Unfortunately, there are a few problems with this patch: > > 1.I will be eliminating synchronize_rcu_mult() due to the fact that > the upcoming RCU flavor consolidation eliminates its sole caller. > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") > in my -rcu tree. This would of course also eliminate the effects > of this patch. Your patch covers our use-case already, but I still think that the semantics for wait_rcu_gp is not clear to me. The problem for us was that sched_cpu_deactivate would call synchronize_rcu_mult which does not check for "expedited" at all. So even though we are already using rcu_expedited sysctl variable, synchronize_rcu_mult was just ignoring it. That being said, I indeed overlooked rcu_normal and that it takes precedence over expedited and I did not notice at all the deadlock you mentioned below! That can however be easily fixed by also checking for !rcu_gp_is_normal. > > 2.The real-time guys' users are not going to be at all happy > with the IPIs resulting from the _expedited() API members. > Yes, they can boot with rcupdate.rcu_normal=1, but they don't > always need that big a hammer, and use of this kernel parameter > can slow down boot, hibernation, suspend, network configuration, > and much else besides. We therefore don't want them to have to > use rcupdate.rcu_normal=1 unless absolutely necessary. I might be missing something here. Why would they need to "explicitly" use rcu_normal? If rcu_expedited is set, would not the expected behavior is to call into the expedited version? My patch should only activate *expedited* if and only if it is set. I think I might be misunderstanding the expected behavior from synchronize_rcu_mult. My understanding is that something like: synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an identical behavior, right? At least in this commit: commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication") .. the change clearly gives the impression that they can be used interchangeably. The problem is that this is not true when you look at the implementation. One of them (i.e. synchronize_rcu) will respect the expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) simply ignores it. So my patch is about making sure that both of the variants actually respect it. > 3.If the real-time guys' users were to have booted with > rcupdate.rcu_normal=1, then synchronize_sched_expedited() > would invoke _synchronize_rcu_expedited, which would invoke > wait_rcu_gp(), which would invoke _wait_rcu_gp() which would > invoke __wait_rcu_gp(), which, given your patch, would in turn > invoke synchronize_sched_expedited(). This situation could > well prevent their systems from meeting their response-time > requirements. > > So I cannot accept this patch nor for that matter any similar patch. > > But what were you really trying to get done here? If you were thinking > of adding another synchronize_rcu_mult(), the flavor consolidation will > make that unnecessary in most cases. If you are trying to speed up > CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable > when taking a CPU offline. If something else, please let me know what > it is so that we can work out how the problem might best be solved. > > Thanx, Paul > > > > > --- > > kernel/rcu/update.c | 34 -- > > 1 file changed, 28 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > > index 68fa19a..44b8817 100644 > > --- a/kernel/rcu/update.c > > +++ b/kernel/rcu/update.c > > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, > > call_rcu_func_t *crcu_array, > > might_sleep(); > > continue; > > } > > - init_rcu_head_on_stack(_array[i].head); > > - init_completion(_array[i].completion); > > + > > for (j = 0; j < i; j++) > > if (crcu_array[j] == crcu_array[i]) > >
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote: > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > > > > When expedited grace-period is set, both synchronize_sched > > synchronize_rcu_bh can be optimized to have a significantly lower latency. > > > > Improve wait_rcu_gp handling to also account for expedited grace-period. > > The downside is that wait_rcu_gp will not wait anymore for all RCU variants > > concurrently when an expedited grace-period is set, however, given the > > improved latency it does not really matter. > > > > Cc: Paul E. McKenney > > Cc: Josh Triplett > > Cc: Steven Rostedt > > Cc: Mathieu Desnoyers > > Cc: Lai Jiangshan > > Cc: linux-kernel@vger.kernel.org > > Signed-off-by: KarimAllah Ahmed > > Cute! > > Unfortunately, there are a few problems with this patch: > > 1.I will be eliminating synchronize_rcu_mult() due to the fact that > the upcoming RCU flavor consolidation eliminates its sole caller. > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") > in my -rcu tree. This would of course also eliminate the effects > of this patch. Your patch covers our use-case already, but I still think that the semantics for wait_rcu_gp is not clear to me. The problem for us was that sched_cpu_deactivate would call synchronize_rcu_mult which does not check for "expedited" at all. So even though we are already using rcu_expedited sysctl variable, synchronize_rcu_mult was just ignoring it. That being said, I indeed overlooked rcu_normal and that it takes precedence over expedited and I did not notice at all the deadlock you mentioned below! That can however be easily fixed by also checking for !rcu_gp_is_normal. > > 2.The real-time guys' users are not going to be at all happy > with the IPIs resulting from the _expedited() API members. > Yes, they can boot with rcupdate.rcu_normal=1, but they don't > always need that big a hammer, and use of this kernel parameter > can slow down boot, hibernation, suspend, network configuration, > and much else besides. We therefore don't want them to have to > use rcupdate.rcu_normal=1 unless absolutely necessary. I might be missing something here. Why would they need to "explicitly" use rcu_normal? If rcu_expedited is set, would not the expected behavior is to call into the expedited version? My patch should only activate *expedited* if and only if it is set. I think I might be misunderstanding the expected behavior from synchronize_rcu_mult. My understanding is that something like: synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an identical behavior, right? At least in this commit: commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication") .. the change clearly gives the impression that they can be used interchangeably. The problem is that this is not true when you look at the implementation. One of them (i.e. synchronize_rcu) will respect the expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) simply ignores it. So my patch is about making sure that both of the variants actually respect it. > 3.If the real-time guys' users were to have booted with > rcupdate.rcu_normal=1, then synchronize_sched_expedited() > would invoke _synchronize_rcu_expedited, which would invoke > wait_rcu_gp(), which would invoke _wait_rcu_gp() which would > invoke __wait_rcu_gp(), which, given your patch, would in turn > invoke synchronize_sched_expedited(). This situation could > well prevent their systems from meeting their response-time > requirements. > > So I cannot accept this patch nor for that matter any similar patch. > > But what were you really trying to get done here? If you were thinking > of adding another synchronize_rcu_mult(), the flavor consolidation will > make that unnecessary in most cases. If you are trying to speed up > CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable > when taking a CPU offline. If something else, please let me know what > it is so that we can work out how the problem might best be solved. > > Thanx, Paul > > > > > --- > > kernel/rcu/update.c | 34 -- > > 1 file changed, 28 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > > index 68fa19a..44b8817 100644 > > --- a/kernel/rcu/update.c > > +++ b/kernel/rcu/update.c > > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, > > call_rcu_func_t *crcu_array, > > might_sleep(); > > continue; > > } > > - init_rcu_head_on_stack(_array[i].head); > > - init_completion(_array[i].completion); > > + > > for (j = 0; j < i; j++) > > if (crcu_array[j] == crcu_array[i]) > >
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > When expedited grace-period is set, both synchronize_sched > synchronize_rcu_bh can be optimized to have a significantly lower latency. > > Improve wait_rcu_gp handling to also account for expedited grace-period. > The downside is that wait_rcu_gp will not wait anymore for all RCU variants > concurrently when an expedited grace-period is set, however, given the > improved latency it does not really matter. > > Cc: Paul E. McKenney > Cc: Josh Triplett > Cc: Steven Rostedt > Cc: Mathieu Desnoyers > Cc: Lai Jiangshan > Cc: linux-kernel@vger.kernel.org > Signed-off-by: KarimAllah Ahmed Cute! Unfortunately, there are a few problems with this patch: 1. I will be eliminating synchronize_rcu_mult() due to the fact that the upcoming RCU flavor consolidation eliminates its sole caller. See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") in my -rcu tree. This would of course also eliminate the effects of this patch. 2. The real-time guys' users are not going to be at all happy with the IPIs resulting from the _expedited() API members. Yes, they can boot with rcupdate.rcu_normal=1, but they don't always need that big a hammer, and use of this kernel parameter can slow down boot, hibernation, suspend, network configuration, and much else besides. We therefore don't want them to have to use rcupdate.rcu_normal=1 unless absolutely necessary. 3. If the real-time guys' users were to have booted with rcupdate.rcu_normal=1, then synchronize_sched_expedited() would invoke _synchronize_rcu_expedited, which would invoke wait_rcu_gp(), which would invoke _wait_rcu_gp() which would invoke __wait_rcu_gp(), which, given your patch, would in turn invoke synchronize_sched_expedited(). This situation could well prevent their systems from meeting their response-time requirements. So I cannot accept this patch nor for that matter any similar patch. But what were you really trying to get done here? If you were thinking of adding another synchronize_rcu_mult(), the flavor consolidation will make that unnecessary in most cases. If you are trying to speed up CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable when taking a CPU offline. If something else, please let me know what it is so that we can work out how the problem might best be solved. Thanx, Paul > --- > kernel/rcu/update.c | 34 -- > 1 file changed, 28 insertions(+), 6 deletions(-) > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 68fa19a..44b8817 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, > call_rcu_func_t *crcu_array, > might_sleep(); > continue; > } > - init_rcu_head_on_stack(_array[i].head); > - init_completion(_array[i].completion); > + > for (j = 0; j < i; j++) > if (crcu_array[j] == crcu_array[i]) > break; > - if (j == i) > - (crcu_array[i])(_array[i].head, wakeme_after_rcu); > + if (j != i) > + continue; > + > + if ((crcu_array[i] == call_rcu_sched || > + crcu_array[i] == call_rcu_bh) > + && rcu_gp_is_expedited()) { > + if (crcu_array[i] == call_rcu_sched) > + synchronize_sched_expedited(); > + else > + synchronize_rcu_bh_expedited(); > + > + continue; > + } > + > + init_rcu_head_on_stack(_array[i].head); > + init_completion(_array[i].completion); > + (crcu_array[i])(_array[i].head, wakeme_after_rcu); > } > > /* Wait for all callbacks to be invoked. */ > @@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, > call_rcu_func_t *crcu_array, > (crcu_array[i] == call_rcu || >crcu_array[i] == call_rcu_bh)) > continue; > + > + if ((crcu_array[i] == call_rcu_sched || > + crcu_array[i] == call_rcu_bh) > + && rcu_gp_is_expedited()) > + continue; > + > for (j = 0; j < i; j++) > if (crcu_array[j] == crcu_array[i]) > break; > - if (j == i) > - wait_for_completion(_array[i].completion); > + if (j != i) > + continue; > + > + wait_for_completion(_array[i].completion); > destroy_rcu_head_on_stack(_array[i].head); >
Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote: > When expedited grace-period is set, both synchronize_sched > synchronize_rcu_bh can be optimized to have a significantly lower latency. > > Improve wait_rcu_gp handling to also account for expedited grace-period. > The downside is that wait_rcu_gp will not wait anymore for all RCU variants > concurrently when an expedited grace-period is set, however, given the > improved latency it does not really matter. > > Cc: Paul E. McKenney > Cc: Josh Triplett > Cc: Steven Rostedt > Cc: Mathieu Desnoyers > Cc: Lai Jiangshan > Cc: linux-kernel@vger.kernel.org > Signed-off-by: KarimAllah Ahmed Cute! Unfortunately, there are a few problems with this patch: 1. I will be eliminating synchronize_rcu_mult() due to the fact that the upcoming RCU flavor consolidation eliminates its sole caller. See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()") in my -rcu tree. This would of course also eliminate the effects of this patch. 2. The real-time guys' users are not going to be at all happy with the IPIs resulting from the _expedited() API members. Yes, they can boot with rcupdate.rcu_normal=1, but they don't always need that big a hammer, and use of this kernel parameter can slow down boot, hibernation, suspend, network configuration, and much else besides. We therefore don't want them to have to use rcupdate.rcu_normal=1 unless absolutely necessary. 3. If the real-time guys' users were to have booted with rcupdate.rcu_normal=1, then synchronize_sched_expedited() would invoke _synchronize_rcu_expedited, which would invoke wait_rcu_gp(), which would invoke _wait_rcu_gp() which would invoke __wait_rcu_gp(), which, given your patch, would in turn invoke synchronize_sched_expedited(). This situation could well prevent their systems from meeting their response-time requirements. So I cannot accept this patch nor for that matter any similar patch. But what were you really trying to get done here? If you were thinking of adding another synchronize_rcu_mult(), the flavor consolidation will make that unnecessary in most cases. If you are trying to speed up CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable when taking a CPU offline. If something else, please let me know what it is so that we can work out how the problem might best be solved. Thanx, Paul > --- > kernel/rcu/update.c | 34 -- > 1 file changed, 28 insertions(+), 6 deletions(-) > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 68fa19a..44b8817 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, > call_rcu_func_t *crcu_array, > might_sleep(); > continue; > } > - init_rcu_head_on_stack(_array[i].head); > - init_completion(_array[i].completion); > + > for (j = 0; j < i; j++) > if (crcu_array[j] == crcu_array[i]) > break; > - if (j == i) > - (crcu_array[i])(_array[i].head, wakeme_after_rcu); > + if (j != i) > + continue; > + > + if ((crcu_array[i] == call_rcu_sched || > + crcu_array[i] == call_rcu_bh) > + && rcu_gp_is_expedited()) { > + if (crcu_array[i] == call_rcu_sched) > + synchronize_sched_expedited(); > + else > + synchronize_rcu_bh_expedited(); > + > + continue; > + } > + > + init_rcu_head_on_stack(_array[i].head); > + init_completion(_array[i].completion); > + (crcu_array[i])(_array[i].head, wakeme_after_rcu); > } > > /* Wait for all callbacks to be invoked. */ > @@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, > call_rcu_func_t *crcu_array, > (crcu_array[i] == call_rcu || >crcu_array[i] == call_rcu_bh)) > continue; > + > + if ((crcu_array[i] == call_rcu_sched || > + crcu_array[i] == call_rcu_bh) > + && rcu_gp_is_expedited()) > + continue; > + > for (j = 0; j < i; j++) > if (crcu_array[j] == crcu_array[i]) > break; > - if (j == i) > - wait_for_completion(_array[i].completion); > + if (j != i) > + continue; > + > + wait_for_completion(_array[i].completion); > destroy_rcu_head_on_stack(_array[i].head); >
[PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
When expedited grace-period is set, both synchronize_sched synchronize_rcu_bh can be optimized to have a significantly lower latency. Improve wait_rcu_gp handling to also account for expedited grace-period. The downside is that wait_rcu_gp will not wait anymore for all RCU variants concurrently when an expedited grace-period is set, however, given the improved latency it does not really matter. Cc: Paul E. McKenney Cc: Josh Triplett Cc: Steven Rostedt Cc: Mathieu Desnoyers Cc: Lai Jiangshan Cc: linux-kernel@vger.kernel.org Signed-off-by: KarimAllah Ahmed --- kernel/rcu/update.c | 34 -- 1 file changed, 28 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 68fa19a..44b8817 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, might_sleep(); continue; } - init_rcu_head_on_stack(_array[i].head); - init_completion(_array[i].completion); + for (j = 0; j < i; j++) if (crcu_array[j] == crcu_array[i]) break; - if (j == i) - (crcu_array[i])(_array[i].head, wakeme_after_rcu); + if (j != i) + continue; + + if ((crcu_array[i] == call_rcu_sched || +crcu_array[i] == call_rcu_bh) + && rcu_gp_is_expedited()) { + if (crcu_array[i] == call_rcu_sched) + synchronize_sched_expedited(); + else + synchronize_rcu_bh_expedited(); + + continue; + } + + init_rcu_head_on_stack(_array[i].head); + init_completion(_array[i].completion); + (crcu_array[i])(_array[i].head, wakeme_after_rcu); } /* Wait for all callbacks to be invoked. */ @@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, (crcu_array[i] == call_rcu || crcu_array[i] == call_rcu_bh)) continue; + + if ((crcu_array[i] == call_rcu_sched || +crcu_array[i] == call_rcu_bh) + && rcu_gp_is_expedited()) + continue; + for (j = 0; j < i; j++) if (crcu_array[j] == crcu_array[i]) break; - if (j == i) - wait_for_completion(_array[i].completion); + if (j != i) + continue; + + wait_for_completion(_array[i].completion); destroy_rcu_head_on_stack(_array[i].head); } } -- 2.7.4
[PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp
When expedited grace-period is set, both synchronize_sched synchronize_rcu_bh can be optimized to have a significantly lower latency. Improve wait_rcu_gp handling to also account for expedited grace-period. The downside is that wait_rcu_gp will not wait anymore for all RCU variants concurrently when an expedited grace-period is set, however, given the improved latency it does not really matter. Cc: Paul E. McKenney Cc: Josh Triplett Cc: Steven Rostedt Cc: Mathieu Desnoyers Cc: Lai Jiangshan Cc: linux-kernel@vger.kernel.org Signed-off-by: KarimAllah Ahmed --- kernel/rcu/update.c | 34 -- 1 file changed, 28 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 68fa19a..44b8817 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, might_sleep(); continue; } - init_rcu_head_on_stack(_array[i].head); - init_completion(_array[i].completion); + for (j = 0; j < i; j++) if (crcu_array[j] == crcu_array[i]) break; - if (j == i) - (crcu_array[i])(_array[i].head, wakeme_after_rcu); + if (j != i) + continue; + + if ((crcu_array[i] == call_rcu_sched || +crcu_array[i] == call_rcu_bh) + && rcu_gp_is_expedited()) { + if (crcu_array[i] == call_rcu_sched) + synchronize_sched_expedited(); + else + synchronize_rcu_bh_expedited(); + + continue; + } + + init_rcu_head_on_stack(_array[i].head); + init_completion(_array[i].completion); + (crcu_array[i])(_array[i].head, wakeme_after_rcu); } /* Wait for all callbacks to be invoked. */ @@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, (crcu_array[i] == call_rcu || crcu_array[i] == call_rcu_bh)) continue; + + if ((crcu_array[i] == call_rcu_sched || +crcu_array[i] == call_rcu_bh) + && rcu_gp_is_expedited()) + continue; + for (j = 0; j < i; j++) if (crcu_array[j] == crcu_array[i]) break; - if (j == i) - wait_for_completion(_array[i].completion); + if (j != i) + continue; + + wait_for_completion(_array[i].completion); destroy_rcu_head_on_stack(_array[i].head); } } -- 2.7.4
[PATCH v3 tip/core/rcu 24/40] srcu: Move combining-tree definitions for SRCU's benefit
This commit moves the C preprocessor code that defines the default shape of the rcu_node combining tree to a new include/linux/rcu_node_tree.h file as a first step towards enabling SRCU to create its own combining tree, which in turn enables SRCU to implement per-CPU callback handling, thus avoiding contention on the lock currently guarding the single list of callbacks. Note that users of SRCU still need to know the size of the srcu_struct structure, hence include/linux rather than kernel/rcu. This commit is code-movement only. Signed-off-by: Paul E. McKenney--- include/linux/rcu_node_tree.h | 102 ++ kernel/rcu/tree.h | 71 + 2 files changed, 103 insertions(+), 70 deletions(-) create mode 100644 include/linux/rcu_node_tree.h diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h new file mode 100644 index ..b7eb97096b1c --- /dev/null +++ b/include/linux/rcu_node_tree.h @@ -0,0 +1,102 @@ +/* + * RCU node combining tree definitions. These are used to compute + * global attributes while avoiding common-case global contention. A key + * property that these computations rely on is a tournament-style approach + * where only one of the tasks contending a lower level in the tree need + * advance to the next higher level. If properly configured, this allows + * unlimited scalability while maintaining a constant level of contention + * on the root node. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright IBM Corporation, 2017 + * + * Author: Paul E. McKenney + */ + +#ifndef __LINUX_RCU_NODE_TREE_H +#define __LINUX_RCU_NODE_TREE_H + +/* + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and + * CONFIG_RCU_FANOUT_LEAF. + * In theory, it should be possible to add more levels straightforwardly. + * In practice, this did work well going from three levels to four. + * Of course, your mileage may vary. + */ + +#ifdef CONFIG_RCU_FANOUT +#define RCU_FANOUT CONFIG_RCU_FANOUT +#else /* #ifdef CONFIG_RCU_FANOUT */ +# ifdef CONFIG_64BIT +# define RCU_FANOUT 64 +# else +# define RCU_FANOUT 32 +# endif +#endif /* #else #ifdef CONFIG_RCU_FANOUT */ + +#ifdef CONFIG_RCU_FANOUT_LEAF +#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF +#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */ +#define RCU_FANOUT_LEAF 16 +#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */ + +#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) +#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) +#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) +#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + +#if NR_CPUS <= RCU_FANOUT_1 +# define RCU_NUM_LVLS 1 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_NODES NUM_RCU_LVL_0 +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 } +# define RCU_NODE_NAME_INIT { "rcu_node_0" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } +#elif NR_CPUS <= RCU_FANOUT_2 +# define RCU_NUM_LVLS 2 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } +#elif NR_CPUS <= RCU_FANOUT_3 +# define RCU_NUM_LVLS 3 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } +#elif NR_CPUS <= RCU_FANOUT_4 +# define RCU_NUM_LVLS 4 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS,
[PATCH v3 tip/core/rcu 24/40] srcu: Move combining-tree definitions for SRCU's benefit
This commit moves the C preprocessor code that defines the default shape of the rcu_node combining tree to a new include/linux/rcu_node_tree.h file as a first step towards enabling SRCU to create its own combining tree, which in turn enables SRCU to implement per-CPU callback handling, thus avoiding contention on the lock currently guarding the single list of callbacks. Note that users of SRCU still need to know the size of the srcu_struct structure, hence include/linux rather than kernel/rcu. This commit is code-movement only. Signed-off-by: Paul E. McKenney --- include/linux/rcu_node_tree.h | 102 ++ kernel/rcu/tree.h | 71 + 2 files changed, 103 insertions(+), 70 deletions(-) create mode 100644 include/linux/rcu_node_tree.h diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h new file mode 100644 index ..b7eb97096b1c --- /dev/null +++ b/include/linux/rcu_node_tree.h @@ -0,0 +1,102 @@ +/* + * RCU node combining tree definitions. These are used to compute + * global attributes while avoiding common-case global contention. A key + * property that these computations rely on is a tournament-style approach + * where only one of the tasks contending a lower level in the tree need + * advance to the next higher level. If properly configured, this allows + * unlimited scalability while maintaining a constant level of contention + * on the root node. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright IBM Corporation, 2017 + * + * Author: Paul E. McKenney + */ + +#ifndef __LINUX_RCU_NODE_TREE_H +#define __LINUX_RCU_NODE_TREE_H + +/* + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and + * CONFIG_RCU_FANOUT_LEAF. + * In theory, it should be possible to add more levels straightforwardly. + * In practice, this did work well going from three levels to four. + * Of course, your mileage may vary. + */ + +#ifdef CONFIG_RCU_FANOUT +#define RCU_FANOUT CONFIG_RCU_FANOUT +#else /* #ifdef CONFIG_RCU_FANOUT */ +# ifdef CONFIG_64BIT +# define RCU_FANOUT 64 +# else +# define RCU_FANOUT 32 +# endif +#endif /* #else #ifdef CONFIG_RCU_FANOUT */ + +#ifdef CONFIG_RCU_FANOUT_LEAF +#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF +#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */ +#define RCU_FANOUT_LEAF 16 +#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */ + +#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) +#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) +#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) +#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + +#if NR_CPUS <= RCU_FANOUT_1 +# define RCU_NUM_LVLS 1 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_NODES NUM_RCU_LVL_0 +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 } +# define RCU_NODE_NAME_INIT { "rcu_node_0" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } +#elif NR_CPUS <= RCU_FANOUT_2 +# define RCU_NUM_LVLS 2 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } +#elif NR_CPUS <= RCU_FANOUT_3 +# define RCU_NUM_LVLS 3 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } +#elif NR_CPUS <= RCU_FANOUT_4 +# define RCU_NUM_LVLS 4 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0
[PATCH v2 tip/core/rcu 24/39] srcu: Move combining-tree definitions for SRCU's benefit
This commit moves the C preprocessor code that defines the default shape of the rcu_node combining tree to a new include/linux/rcu_node_tree.h file as a first step towards enabling SRCU to create its own combining tree, which in turn enables SRCU to implement per-CPU callback handling, thus avoiding contention on the lock currently guarding the single list of callbacks. Note that users of SRCU still need to know the size of the srcu_struct structure, hence include/linux rather than kernel/rcu. This commit is code-movement only. Signed-off-by: Paul E. McKenney--- include/linux/rcu_node_tree.h | 102 ++ kernel/rcu/tree.h | 71 + 2 files changed, 103 insertions(+), 70 deletions(-) create mode 100644 include/linux/rcu_node_tree.h diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h new file mode 100644 index ..b7eb97096b1c --- /dev/null +++ b/include/linux/rcu_node_tree.h @@ -0,0 +1,102 @@ +/* + * RCU node combining tree definitions. These are used to compute + * global attributes while avoiding common-case global contention. A key + * property that these computations rely on is a tournament-style approach + * where only one of the tasks contending a lower level in the tree need + * advance to the next higher level. If properly configured, this allows + * unlimited scalability while maintaining a constant level of contention + * on the root node. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright IBM Corporation, 2017 + * + * Author: Paul E. McKenney + */ + +#ifndef __LINUX_RCU_NODE_TREE_H +#define __LINUX_RCU_NODE_TREE_H + +/* + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and + * CONFIG_RCU_FANOUT_LEAF. + * In theory, it should be possible to add more levels straightforwardly. + * In practice, this did work well going from three levels to four. + * Of course, your mileage may vary. + */ + +#ifdef CONFIG_RCU_FANOUT +#define RCU_FANOUT CONFIG_RCU_FANOUT +#else /* #ifdef CONFIG_RCU_FANOUT */ +# ifdef CONFIG_64BIT +# define RCU_FANOUT 64 +# else +# define RCU_FANOUT 32 +# endif +#endif /* #else #ifdef CONFIG_RCU_FANOUT */ + +#ifdef CONFIG_RCU_FANOUT_LEAF +#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF +#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */ +#define RCU_FANOUT_LEAF 16 +#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */ + +#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) +#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) +#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) +#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + +#if NR_CPUS <= RCU_FANOUT_1 +# define RCU_NUM_LVLS 1 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_NODES NUM_RCU_LVL_0 +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 } +# define RCU_NODE_NAME_INIT { "rcu_node_0" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } +#elif NR_CPUS <= RCU_FANOUT_2 +# define RCU_NUM_LVLS 2 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } +#elif NR_CPUS <= RCU_FANOUT_3 +# define RCU_NUM_LVLS 3 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } +#elif NR_CPUS <= RCU_FANOUT_4 +# define RCU_NUM_LVLS 4 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS,
[PATCH v2 tip/core/rcu 24/39] srcu: Move combining-tree definitions for SRCU's benefit
This commit moves the C preprocessor code that defines the default shape of the rcu_node combining tree to a new include/linux/rcu_node_tree.h file as a first step towards enabling SRCU to create its own combining tree, which in turn enables SRCU to implement per-CPU callback handling, thus avoiding contention on the lock currently guarding the single list of callbacks. Note that users of SRCU still need to know the size of the srcu_struct structure, hence include/linux rather than kernel/rcu. This commit is code-movement only. Signed-off-by: Paul E. McKenney --- include/linux/rcu_node_tree.h | 102 ++ kernel/rcu/tree.h | 71 + 2 files changed, 103 insertions(+), 70 deletions(-) create mode 100644 include/linux/rcu_node_tree.h diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h new file mode 100644 index ..b7eb97096b1c --- /dev/null +++ b/include/linux/rcu_node_tree.h @@ -0,0 +1,102 @@ +/* + * RCU node combining tree definitions. These are used to compute + * global attributes while avoiding common-case global contention. A key + * property that these computations rely on is a tournament-style approach + * where only one of the tasks contending a lower level in the tree need + * advance to the next higher level. If properly configured, this allows + * unlimited scalability while maintaining a constant level of contention + * on the root node. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright IBM Corporation, 2017 + * + * Author: Paul E. McKenney + */ + +#ifndef __LINUX_RCU_NODE_TREE_H +#define __LINUX_RCU_NODE_TREE_H + +/* + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and + * CONFIG_RCU_FANOUT_LEAF. + * In theory, it should be possible to add more levels straightforwardly. + * In practice, this did work well going from three levels to four. + * Of course, your mileage may vary. + */ + +#ifdef CONFIG_RCU_FANOUT +#define RCU_FANOUT CONFIG_RCU_FANOUT +#else /* #ifdef CONFIG_RCU_FANOUT */ +# ifdef CONFIG_64BIT +# define RCU_FANOUT 64 +# else +# define RCU_FANOUT 32 +# endif +#endif /* #else #ifdef CONFIG_RCU_FANOUT */ + +#ifdef CONFIG_RCU_FANOUT_LEAF +#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF +#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */ +#define RCU_FANOUT_LEAF 16 +#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */ + +#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) +#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) +#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) +#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + +#if NR_CPUS <= RCU_FANOUT_1 +# define RCU_NUM_LVLS 1 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_NODES NUM_RCU_LVL_0 +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 } +# define RCU_NODE_NAME_INIT { "rcu_node_0" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } +#elif NR_CPUS <= RCU_FANOUT_2 +# define RCU_NUM_LVLS 2 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } +#elif NR_CPUS <= RCU_FANOUT_3 +# define RCU_NUM_LVLS 3 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } +#elif NR_CPUS <= RCU_FANOUT_4 +# define RCU_NUM_LVLS 4 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0
[PATCH tip/core/rcu 27/40] srcu: Move combining-tree definitions for SRCU's benefit
This commit moves the C preprocessor code that defines the default shape of the rcu_node combining tree to a new include/linux/rcu_node_tree.h file as a first step towards enabling SRCU to create its own combining tree, which in turn enables SRCU to implement per-CPU callback handling, thus avoiding contention on the lock currently guarding the single list of callbacks. Note that users of SRCU still need to know the size of the srcu_struct structure, hence include/linux rather than kernel/rcu. This commit is code-movement only. Signed-off-by: Paul E. McKenney--- include/linux/rcu_node_tree.h | 102 ++ kernel/rcu/tree.h | 71 + 2 files changed, 103 insertions(+), 70 deletions(-) create mode 100644 include/linux/rcu_node_tree.h diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h new file mode 100644 index ..b7eb97096b1c --- /dev/null +++ b/include/linux/rcu_node_tree.h @@ -0,0 +1,102 @@ +/* + * RCU node combining tree definitions. These are used to compute + * global attributes while avoiding common-case global contention. A key + * property that these computations rely on is a tournament-style approach + * where only one of the tasks contending a lower level in the tree need + * advance to the next higher level. If properly configured, this allows + * unlimited scalability while maintaining a constant level of contention + * on the root node. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright IBM Corporation, 2017 + * + * Author: Paul E. McKenney + */ + +#ifndef __LINUX_RCU_NODE_TREE_H +#define __LINUX_RCU_NODE_TREE_H + +/* + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and + * CONFIG_RCU_FANOUT_LEAF. + * In theory, it should be possible to add more levels straightforwardly. + * In practice, this did work well going from three levels to four. + * Of course, your mileage may vary. + */ + +#ifdef CONFIG_RCU_FANOUT +#define RCU_FANOUT CONFIG_RCU_FANOUT +#else /* #ifdef CONFIG_RCU_FANOUT */ +# ifdef CONFIG_64BIT +# define RCU_FANOUT 64 +# else +# define RCU_FANOUT 32 +# endif +#endif /* #else #ifdef CONFIG_RCU_FANOUT */ + +#ifdef CONFIG_RCU_FANOUT_LEAF +#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF +#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */ +#define RCU_FANOUT_LEAF 16 +#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */ + +#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) +#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) +#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) +#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + +#if NR_CPUS <= RCU_FANOUT_1 +# define RCU_NUM_LVLS 1 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_NODES NUM_RCU_LVL_0 +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 } +# define RCU_NODE_NAME_INIT { "rcu_node_0" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } +#elif NR_CPUS <= RCU_FANOUT_2 +# define RCU_NUM_LVLS 2 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } +#elif NR_CPUS <= RCU_FANOUT_3 +# define RCU_NUM_LVLS 3 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } +#elif NR_CPUS <= RCU_FANOUT_4 +# define RCU_NUM_LVLS 4 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS,
[PATCH tip/core/rcu 27/40] srcu: Move combining-tree definitions for SRCU's benefit
This commit moves the C preprocessor code that defines the default shape of the rcu_node combining tree to a new include/linux/rcu_node_tree.h file as a first step towards enabling SRCU to create its own combining tree, which in turn enables SRCU to implement per-CPU callback handling, thus avoiding contention on the lock currently guarding the single list of callbacks. Note that users of SRCU still need to know the size of the srcu_struct structure, hence include/linux rather than kernel/rcu. This commit is code-movement only. Signed-off-by: Paul E. McKenney --- include/linux/rcu_node_tree.h | 102 ++ kernel/rcu/tree.h | 71 + 2 files changed, 103 insertions(+), 70 deletions(-) create mode 100644 include/linux/rcu_node_tree.h diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h new file mode 100644 index ..b7eb97096b1c --- /dev/null +++ b/include/linux/rcu_node_tree.h @@ -0,0 +1,102 @@ +/* + * RCU node combining tree definitions. These are used to compute + * global attributes while avoiding common-case global contention. A key + * property that these computations rely on is a tournament-style approach + * where only one of the tasks contending a lower level in the tree need + * advance to the next higher level. If properly configured, this allows + * unlimited scalability while maintaining a constant level of contention + * on the root node. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright IBM Corporation, 2017 + * + * Author: Paul E. McKenney + */ + +#ifndef __LINUX_RCU_NODE_TREE_H +#define __LINUX_RCU_NODE_TREE_H + +/* + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and + * CONFIG_RCU_FANOUT_LEAF. + * In theory, it should be possible to add more levels straightforwardly. + * In practice, this did work well going from three levels to four. + * Of course, your mileage may vary. + */ + +#ifdef CONFIG_RCU_FANOUT +#define RCU_FANOUT CONFIG_RCU_FANOUT +#else /* #ifdef CONFIG_RCU_FANOUT */ +# ifdef CONFIG_64BIT +# define RCU_FANOUT 64 +# else +# define RCU_FANOUT 32 +# endif +#endif /* #else #ifdef CONFIG_RCU_FANOUT */ + +#ifdef CONFIG_RCU_FANOUT_LEAF +#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF +#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */ +#define RCU_FANOUT_LEAF 16 +#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */ + +#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) +#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) +#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) +#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + +#if NR_CPUS <= RCU_FANOUT_1 +# define RCU_NUM_LVLS 1 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_NODES NUM_RCU_LVL_0 +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 } +# define RCU_NODE_NAME_INIT { "rcu_node_0" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } +#elif NR_CPUS <= RCU_FANOUT_2 +# define RCU_NUM_LVLS 2 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } +#elif NR_CPUS <= RCU_FANOUT_3 +# define RCU_NUM_LVLS 3 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) +# define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } +# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } +# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } +#elif NR_CPUS <= RCU_FANOUT_4 +# define RCU_NUM_LVLS 4 +# define NUM_RCU_LVL_0 1 +# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) +# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +# define NUM_RCU_NODES (NUM_RCU_LVL_0
BENEFIT
You are a recipient to Mrs Julie Leach Donation of $3 million USD. Contact(julieleac...@gmail.com) for claims.
BENEFIT
You are a recipient to Mrs Julie Leach Donation of $3 million USD. Contact(julieleac...@gmail.com) for claims.
BENEFIT
You are a recipient to Mrs Julie Leach Donation of $3 million USD. Contact(julieleac...@gmail.com) for claims.
BENEFIT
You are a recipient to Mrs Julie Leach Donation of $3 million USD. Contact(julieleac...@gmail.com) for claims.
Grant Benefit
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@gmail.com) for claims
Grant Benefit
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@gmail.com) for claims
Grant Benefit
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@gmail.com) for claims
Grant Benefit
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@gmail.com) for claims
Grant Benefit
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@gmail.com) for claims
Grant Benefit
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@gmail.com) for claims
Grant Benefit
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@gmail.com) for claims
Grant Benefit
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@gmail.com) for claims
Benefit!
Mutual benefit for the both of us from Camp Stanley, get back to me for more info.
Benefit!
Mutual benefit for the both of us from Camp Stanley, get back to me for more info.
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Jun 30, 2016 2:25 PM, "Dave Hansen"wrote: > > On 06/30/2016 10:36 AM, Andy Lutomirski wrote: > >>> We make baseline_pkru a process-wide baseline and store it in > >>> mm->context. That way, no matter which thread gets interrupted for a > >>> signal, they see consistent values. We only write to it when an app > >>> _specifically_ asks for it to be updated with a special flag to > >>> sys_pkey_set(). > >>> > >>> When an app uses the execute-only support, we implicitly set the > >>> read-disable bit in baseline_pkru for the execute-only pkey. > ... > > Looking at your git tree, which I assume is a reasonably approximation > > of your current patches, this seems to be unimplemented. I, at least, > > would be nervous about using PKRU for protection of critical data if > > signal handlers are unconditionally exempt. > > I actually went along and implemented this using an extra 'flag' for > pkey_get/set(). I just left it out of this stage since I'm having > enough problems getting it in with the existing set of features. :) > > I'm confident we can add this later with the flags we can pass to > pkey_get() and pkey_set(). > > > Also, the lazily allocated no-read key for execute-only is done in the > > name of performance, but it results in odd semantics. How much of a > > performance win is preserving the init optimization of PKRU in > > practice? (I.e. how much faster are XSAVE and XRSTOR?) I can't test > > because even my Skylake laptop doesn't have PKRU. > > This is admittedly not the most realistic benchmark because everything > is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS. > This runs things in pretty tight loops where everything is cache hot. > > The XSAVE instructions are monsters and I'm not super-confident in my > measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS > getting 20-30 cycles when PKRU is in play vs. not. This is with > completely cache-hot data, though. That's surprisingly bad, albeit negligible in the grand scheme of context switches. But maybe we could optimize differently. When switching states, mask out PKRU if it matches between the two states. This could be messy, but, if we switch to using WRPKRU directly some day, then the init optimization becomes moot and this optimization becomes easy. Hmm. If we switch to WRPKRU directly and *always* mask out PKRU, maybe a bunch of your code gets simpler because you won't have to poke around in the saved state. Looking forward, I think the xstate partitions into at least three kinds: 1. Pure user state (FPU, etc) 2. User-affecting XSAVES-only state (the CET mess, etc). 3. User accessible state that is needed in user mode *and* kernel mode. PKRU is like this. Type 3 state needs to be switched eagerly. Types 1 and 2 need to be *saved* eagerly but not necessarily loaded eagerly. I still want to lazy-load it some day. Do you happen to have WRPKRU cycle counts you can share? --Andy
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Jun 30, 2016 2:25 PM, "Dave Hansen" wrote: > > On 06/30/2016 10:36 AM, Andy Lutomirski wrote: > >>> We make baseline_pkru a process-wide baseline and store it in > >>> mm->context. That way, no matter which thread gets interrupted for a > >>> signal, they see consistent values. We only write to it when an app > >>> _specifically_ asks for it to be updated with a special flag to > >>> sys_pkey_set(). > >>> > >>> When an app uses the execute-only support, we implicitly set the > >>> read-disable bit in baseline_pkru for the execute-only pkey. > ... > > Looking at your git tree, which I assume is a reasonably approximation > > of your current patches, this seems to be unimplemented. I, at least, > > would be nervous about using PKRU for protection of critical data if > > signal handlers are unconditionally exempt. > > I actually went along and implemented this using an extra 'flag' for > pkey_get/set(). I just left it out of this stage since I'm having > enough problems getting it in with the existing set of features. :) > > I'm confident we can add this later with the flags we can pass to > pkey_get() and pkey_set(). > > > Also, the lazily allocated no-read key for execute-only is done in the > > name of performance, but it results in odd semantics. How much of a > > performance win is preserving the init optimization of PKRU in > > practice? (I.e. how much faster are XSAVE and XRSTOR?) I can't test > > because even my Skylake laptop doesn't have PKRU. > > This is admittedly not the most realistic benchmark because everything > is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS. > This runs things in pretty tight loops where everything is cache hot. > > The XSAVE instructions are monsters and I'm not super-confident in my > measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS > getting 20-30 cycles when PKRU is in play vs. not. This is with > completely cache-hot data, though. That's surprisingly bad, albeit negligible in the grand scheme of context switches. But maybe we could optimize differently. When switching states, mask out PKRU if it matches between the two states. This could be messy, but, if we switch to using WRPKRU directly some day, then the init optimization becomes moot and this optimization becomes easy. Hmm. If we switch to WRPKRU directly and *always* mask out PKRU, maybe a bunch of your code gets simpler because you won't have to poke around in the saved state. Looking forward, I think the xstate partitions into at least three kinds: 1. Pure user state (FPU, etc) 2. User-affecting XSAVES-only state (the CET mess, etc). 3. User accessible state that is needed in user mode *and* kernel mode. PKRU is like this. Type 3 state needs to be switched eagerly. Types 1 and 2 need to be *saved* eagerly but not necessarily loaded eagerly. I still want to lazy-load it some day. Do you happen to have WRPKRU cycle counts you can share? --Andy
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 06/30/2016 10:36 AM, Andy Lutomirski wrote: >>> We make baseline_pkru a process-wide baseline and store it in >>> mm->context. That way, no matter which thread gets interrupted for a >>> signal, they see consistent values. We only write to it when an app >>> _specifically_ asks for it to be updated with a special flag to >>> sys_pkey_set(). >>> >>> When an app uses the execute-only support, we implicitly set the >>> read-disable bit in baseline_pkru for the execute-only pkey. ... > Looking at your git tree, which I assume is a reasonably approximation > of your current patches, this seems to be unimplemented. I, at least, > would be nervous about using PKRU for protection of critical data if > signal handlers are unconditionally exempt. I actually went along and implemented this using an extra 'flag' for pkey_get/set(). I just left it out of this stage since I'm having enough problems getting it in with the existing set of features. :) I'm confident we can add this later with the flags we can pass to pkey_get() and pkey_set(). > Also, the lazily allocated no-read key for execute-only is done in the > name of performance, but it results in odd semantics. How much of a > performance win is preserving the init optimization of PKRU in > practice? (I.e. how much faster are XSAVE and XRSTOR?) I can't test > because even my Skylake laptop doesn't have PKRU. This is admittedly not the most realistic benchmark because everything is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS. This runs things in pretty tight loops where everything is cache hot. The XSAVE instructions are monsters and I'm not super-confident in my measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS getting 20-30 cycles when PKRU is in play vs. not. This is with completely cache-hot data, though.
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 06/30/2016 10:36 AM, Andy Lutomirski wrote: >>> We make baseline_pkru a process-wide baseline and store it in >>> mm->context. That way, no matter which thread gets interrupted for a >>> signal, they see consistent values. We only write to it when an app >>> _specifically_ asks for it to be updated with a special flag to >>> sys_pkey_set(). >>> >>> When an app uses the execute-only support, we implicitly set the >>> read-disable bit in baseline_pkru for the execute-only pkey. ... > Looking at your git tree, which I assume is a reasonably approximation > of your current patches, this seems to be unimplemented. I, at least, > would be nervous about using PKRU for protection of critical data if > signal handlers are unconditionally exempt. I actually went along and implemented this using an extra 'flag' for pkey_get/set(). I just left it out of this stage since I'm having enough problems getting it in with the existing set of features. :) I'm confident we can add this later with the flags we can pass to pkey_get() and pkey_set(). > Also, the lazily allocated no-read key for execute-only is done in the > name of performance, but it results in odd semantics. How much of a > performance win is preserving the init optimization of PKRU in > practice? (I.e. how much faster are XSAVE and XRSTOR?) I can't test > because even my Skylake laptop doesn't have PKRU. This is admittedly not the most realistic benchmark because everything is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS. This runs things in pretty tight loops where everything is cache hot. The XSAVE instructions are monsters and I'm not super-confident in my measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS getting 20-30 cycles when PKRU is in play vs. not. This is with completely cache-hot data, though.
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Mon, Dec 21, 2015 at 3:07 PM, Andy Lutomirskiwrote: > On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansen > wrote: >> On 12/18/2015 02:28 PM, Andy Lutomirski wrote: >> ... I could imagine that some kernel person would want to use even more keys, but I think two fixed keys are kind of the minimal we'd want to use. >>> >>> I imagine we'd reserve key 0 for normal page and key 1 for deny-read. >>> Let me be a bit more concrete about what I'm suggesting: >>> >>> We'd have thread_struct.baseline_pkru. It would start with key 0 >>> allowing all access and key 1 denying reads. >> >> Are you sure thread_struct is the right place for this? I think of >> signal handlers as a process-wide thing, and it seems a bit goofy if we >> have the PKRU value in a signal handler depend on the PKRU of the thread >> that got interrupted. > > I think you're right. mmu_context_t might be a better choice. > >> >>> We'd have a syscall like set_protection_key that could allocate unused >>> keys and change the values of keys that have been allocated. Those >>> changes would be reflected in baseline_pkru. Changes to keys 0 and 1 >>> in baseline_pkru would not be allowed. >> >> FWIW, I think we can do this without *actually* dedicating key 1 to >> execute-only. But that's a side issue. >> >>> Signal delivery would load baseline_pkru into the PKRU register. >>> Signal restore would restore PKRU to its previous value. >> >> Do you really mean "its previous value" or are you OK with the existing >> behavior which restores PKRU from the XSAVE buffer in the sigcontext? > > By "its previous value" I meant the value in the XSAVE buffer in the > sigcontext. So I think I'm okay with that :) > >> >>> WRPKRU would, of course, override baseline_pkru, but it wouldn't >>> change baseline_pkru. The set_protection_key syscall would modify >>> *both* real PKRU and baseline_pkru. >> >> How about this: >> >> We make baseline_pkru a process-wide baseline and store it in >> mm->context. That way, no matter which thread gets interrupted for a >> signal, they see consistent values. We only write to it when an app >> _specifically_ asks for it to be updated with a special flag to >> sys_pkey_set(). >> >> When an app uses the execute-only support, we implicitly set the >> read-disable bit in baseline_pkru for the execute-only pkey. > > Sounds good, I think. Resurrecting an old thread, but: Looking at your git tree, which I assume is a reasonably approximation of your current patches, this seems to be unimplemented. I, at least, would be nervous about using PKRU for protection of critical data if signal handlers are unconditionally exempt. Also, the lazily allocated no-read key for execute-only is done in the name of performance, but it results in odd semantics. How much of a performance win is preserving the init optimization of PKRU in practice? (I.e. how much faster are XSAVE and XRSTOR?) I can't test because even my Skylake laptop doesn't have PKRU. --Andy
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Mon, Dec 21, 2015 at 3:07 PM, Andy Lutomirski wrote: > On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansen > wrote: >> On 12/18/2015 02:28 PM, Andy Lutomirski wrote: >> ... I could imagine that some kernel person would want to use even more keys, but I think two fixed keys are kind of the minimal we'd want to use. >>> >>> I imagine we'd reserve key 0 for normal page and key 1 for deny-read. >>> Let me be a bit more concrete about what I'm suggesting: >>> >>> We'd have thread_struct.baseline_pkru. It would start with key 0 >>> allowing all access and key 1 denying reads. >> >> Are you sure thread_struct is the right place for this? I think of >> signal handlers as a process-wide thing, and it seems a bit goofy if we >> have the PKRU value in a signal handler depend on the PKRU of the thread >> that got interrupted. > > I think you're right. mmu_context_t might be a better choice. > >> >>> We'd have a syscall like set_protection_key that could allocate unused >>> keys and change the values of keys that have been allocated. Those >>> changes would be reflected in baseline_pkru. Changes to keys 0 and 1 >>> in baseline_pkru would not be allowed. >> >> FWIW, I think we can do this without *actually* dedicating key 1 to >> execute-only. But that's a side issue. >> >>> Signal delivery would load baseline_pkru into the PKRU register. >>> Signal restore would restore PKRU to its previous value. >> >> Do you really mean "its previous value" or are you OK with the existing >> behavior which restores PKRU from the XSAVE buffer in the sigcontext? > > By "its previous value" I meant the value in the XSAVE buffer in the > sigcontext. So I think I'm okay with that :) > >> >>> WRPKRU would, of course, override baseline_pkru, but it wouldn't >>> change baseline_pkru. The set_protection_key syscall would modify >>> *both* real PKRU and baseline_pkru. >> >> How about this: >> >> We make baseline_pkru a process-wide baseline and store it in >> mm->context. That way, no matter which thread gets interrupted for a >> signal, they see consistent values. We only write to it when an app >> _specifically_ asks for it to be updated with a special flag to >> sys_pkey_set(). >> >> When an app uses the execute-only support, we implicitly set the >> read-disable bit in baseline_pkru for the execute-only pkey. > > Sounds good, I think. Resurrecting an old thread, but: Looking at your git tree, which I assume is a reasonably approximation of your current patches, this seems to be unimplemented. I, at least, would be nervous about using PKRU for protection of critical data if signal handlers are unconditionally exempt. Also, the lazily allocated no-read key for execute-only is done in the name of performance, but it results in odd semantics. How much of a performance win is preserving the init optimization of PKRU in practice? (I.e. how much faster are XSAVE and XRSTOR?) I can't test because even my Skylake laptop doesn't have PKRU. --Andy
(Re: MUTUAL BENEFIT!!!)
From: Integrated Crystal Investment Company 30th Floor Millbank Tower 21-24 Millbank London SW1P 4QP Dear Sir, I got your contact from my Google research. I trust my instinct and my instinct has never failed me in the past. I am convinced that you are the person who can assist me in this pending transaction. I am Albert Silva Neto, A Fund Manager with the Integrated Crystal Investment Company. I am contacting you based on my personal interest to develop a mutual business relationship with you in your country. The amount and terms of executing the deal will be made known to you upon receiving a positive response from you. I solicit for your assistance to execute this transaction. I will give you a complete picture of this transaction as soon as I hear from you. I look forward to hearing from you soonest. Regards, Albert Silva Neto
(Re: MUTUAL BENEFIT!!!)
From: Integrated Crystal Investment Company 30th Floor Millbank Tower 21-24 Millbank London SW1P 4QP Dear Sir, I got your contact from my Google research. I trust my instinct and my instinct has never failed me in the past. I am convinced that you are the person who can assist me in this pending transaction. I am Albert Silva Neto, A Fund Manager with the Integrated Crystal Investment Company. I am contacting you based on my personal interest to develop a mutual business relationship with you in your country. The amount and terms of executing the deal will be made known to you upon receiving a positive response from you. I solicit for your assistance to execute this transaction. I will give you a complete picture of this transaction as soon as I hear from you. I look forward to hearing from you soonest. Regards, Albert Silva Neto
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 01:45 PM, Linus Torvalds wrote: > On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen > wrote: >> >> But, if we are picking out an execute-only pkey more dynamically, we've >> got to keep the default value for the entire process somewhere. > > How dynamic do we want to make this, though? Right now, all I plan to do is make it a one-way trip: if a process does a prot=PROT_EXEC mapping, we dedicate a key local to that process, and it gets 14 usable keys. If it doesn't use prot=PROT_EXEC, then it gets 15 usable keys. > I haven't looked at the details, and perhaps more importantly, I don't > know what exactly are the requirements you've gotten from the people > who are expected to actually use this. > > I think we might want to hardcode a couple of keys as "kernel > reserved". And I'd rather reserve them up-front than have some user > program be unhappy later when we want to use them. The one constant I've heard from the folks that are going to use this is that 15 keys is not enough. That's why I'm hesitant to remove _any_ more. > But I do think we might want to have that "no read access" as a real > fixed key too, because I think the kernel itself would want to use it: > > (a) to make sure that it gets the right fault when user space passes > in a execute-only address to a system call. Having a dedicated or static key for execute-only doesn't really change this code. We just have one extra step to go look in the mm->context and see which pkey (if any) is assigned to be execute-only in the fault code. > (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings. The current hardware only applies the keys on _PAGE_USER mappings, so we can't use it for kernel mappings. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 01:45 PM, Linus Torvalds wrote: > On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen >wrote: >> >> But, if we are picking out an execute-only pkey more dynamically, we've >> got to keep the default value for the entire process somewhere. > > How dynamic do we want to make this, though? Right now, all I plan to do is make it a one-way trip: if a process does a prot=PROT_EXEC mapping, we dedicate a key local to that process, and it gets 14 usable keys. If it doesn't use prot=PROT_EXEC, then it gets 15 usable keys. > I haven't looked at the details, and perhaps more importantly, I don't > know what exactly are the requirements you've gotten from the people > who are expected to actually use this. > > I think we might want to hardcode a couple of keys as "kernel > reserved". And I'd rather reserve them up-front than have some user > program be unhappy later when we want to use them. The one constant I've heard from the folks that are going to use this is that 15 keys is not enough. That's why I'm hesitant to remove _any_ more. > But I do think we might want to have that "no read access" as a real > fixed key too, because I think the kernel itself would want to use it: > > (a) to make sure that it gets the right fault when user space passes > in a execute-only address to a system call. Having a dedicated or static key for execute-only doesn't really change this code. We just have one extra step to go look in the mm->context and see which pkey (if any) is assigned to be execute-only in the fault code. > (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings. The current hardware only applies the keys on _PAGE_USER mappings, so we can't use it for kernel mappings. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansen wrote: > On 12/18/2015 02:28 PM, Andy Lutomirski wrote: > ... >>> I could imagine that some kernel person would want to use even more >>> keys, but I think two fixed keys are kind of the minimal we'd want to >>> use. >> >> I imagine we'd reserve key 0 for normal page and key 1 for deny-read. >> Let me be a bit more concrete about what I'm suggesting: >> >> We'd have thread_struct.baseline_pkru. It would start with key 0 >> allowing all access and key 1 denying reads. > > Are you sure thread_struct is the right place for this? I think of > signal handlers as a process-wide thing, and it seems a bit goofy if we > have the PKRU value in a signal handler depend on the PKRU of the thread > that got interrupted. I think you're right. mmu_context_t might be a better choice. > >> We'd have a syscall like set_protection_key that could allocate unused >> keys and change the values of keys that have been allocated. Those >> changes would be reflected in baseline_pkru. Changes to keys 0 and 1 >> in baseline_pkru would not be allowed. > > FWIW, I think we can do this without *actually* dedicating key 1 to > execute-only. But that's a side issue. > >> Signal delivery would load baseline_pkru into the PKRU register. >> Signal restore would restore PKRU to its previous value. > > Do you really mean "its previous value" or are you OK with the existing > behavior which restores PKRU from the XSAVE buffer in the sigcontext? By "its previous value" I meant the value in the XSAVE buffer in the sigcontext. So I think I'm okay with that :) > >> WRPKRU would, of course, override baseline_pkru, but it wouldn't >> change baseline_pkru. The set_protection_key syscall would modify >> *both* real PKRU and baseline_pkru. > > How about this: > > We make baseline_pkru a process-wide baseline and store it in > mm->context. That way, no matter which thread gets interrupted for a > signal, they see consistent values. We only write to it when an app > _specifically_ asks for it to be updated with a special flag to > sys_pkey_set(). > > When an app uses the execute-only support, we implicitly set the > read-disable bit in baseline_pkru for the execute-only pkey. Sounds good, I think. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/21/2015 03:02 PM, Andy Lutomirski wrote: > On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen > wrote: >> On 12/21/2015 02:52 PM, Andy Lutomirski wrote: >>> Perhaps this is silly, but what if the default were changed to deny >>> reads and writes for unallocated keys? Is there a use case that >>> breaks? >> >> It's probably a reasonable debugging feature. >> >> But, anything that takes an XSAVE feature out of its "init state" has >> the potential to do a bit of harm because it increases the potential >> size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably >> don't want to go out of our way to take things out of the init state >> when we're unsure of the benefits. > > Aren't you already doing that with your magic execute-only thing? Yep, but that's with a concrete benefit in mind. > Also, if we ever do the deferred-xstate-restore thing that Rik was > playing with awhile back, then we'll want to switch to using rdpkru > and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of > the XRSTOR and XSAVEOPT state, and this particular issue will become > irrelevant. Yep, agreed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 02:28 PM, Andy Lutomirski wrote: ... >> I could imagine that some kernel person would want to use even more >> keys, but I think two fixed keys are kind of the minimal we'd want to >> use. > > I imagine we'd reserve key 0 for normal page and key 1 for deny-read. > Let me be a bit more concrete about what I'm suggesting: > > We'd have thread_struct.baseline_pkru. It would start with key 0 > allowing all access and key 1 denying reads. Are you sure thread_struct is the right place for this? I think of signal handlers as a process-wide thing, and it seems a bit goofy if we have the PKRU value in a signal handler depend on the PKRU of the thread that got interrupted. > We'd have a syscall like set_protection_key that could allocate unused > keys and change the values of keys that have been allocated. Those > changes would be reflected in baseline_pkru. Changes to keys 0 and 1 > in baseline_pkru would not be allowed. FWIW, I think we can do this without *actually* dedicating key 1 to execute-only. But that's a side issue. > Signal delivery would load baseline_pkru into the PKRU register. > Signal restore would restore PKRU to its previous value. Do you really mean "its previous value" or are you OK with the existing behavior which restores PKRU from the XSAVE buffer in the sigcontext? > WRPKRU would, of course, override baseline_pkru, but it wouldn't > change baseline_pkru. The set_protection_key syscall would modify > *both* real PKRU and baseline_pkru. How about this: We make baseline_pkru a process-wide baseline and store it in mm->context. That way, no matter which thread gets interrupted for a signal, they see consistent values. We only write to it when an app _specifically_ asks for it to be updated with a special flag to sys_pkey_set(). When an app uses the execute-only support, we implicitly set the read-disable bit in baseline_pkru for the execute-only pkey. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen wrote: > On 12/21/2015 02:52 PM, Andy Lutomirski wrote: >> Perhaps this is silly, but what if the default were changed to deny >> reads and writes for unallocated keys? Is there a use case that >> breaks? > > It's probably a reasonable debugging feature. > > But, anything that takes an XSAVE feature out of its "init state" has > the potential to do a bit of harm because it increases the potential > size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably > don't want to go out of our way to take things out of the init state > when we're unsure of the benefits. Aren't you already doing that with your magic execute-only thing? Also, if we ever do the deferred-xstate-restore thing that Rik was playing with awhile back, then we'll want to switch to using rdpkru and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of the XRSTOR and XSAVEOPT state, and this particular issue will become irrelevant. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/21/2015 02:52 PM, Andy Lutomirski wrote: > Perhaps this is silly, but what if the default were changed to deny > reads and writes for unallocated keys? Is there a use case that > breaks? It's probably a reasonable debugging feature. But, anything that takes an XSAVE feature out of its "init state" has the potential to do a bit of harm because it increases the potential size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably don't want to go out of our way to take things out of the init state when we're unsure of the benefits. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Dec 22, 2015 2:04 AM, "Dave Hansen" wrote: > > On 12/18/2015 03:16 PM, Andy Lutomirski wrote: > > Hrm. We might also want an option to change pkru and/or baseline_pkru > > in all threads in the current mm. That's optional but it could be > > handy. Maybe it would be as simple as having the allocate-a-pkey call > > have an option to set an initial baseline value and an option to > > propagate that initial value to pre-existing threads. > > Do you mean actively going in and changing PKRU in other threads? I > fear that will be dangerous. > > IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls > are allowed (somehow). Raw in this case would mean a thread calling > WRPKRU without a system call and without checking in with what any other > threads are doing. > > Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using > it for execute-only). Now, a thread is trying to do this: > > pkey2 = sys_pkey_alloc(); // now pkey2=2 > tmp = rdpkru(); // 0x004 > tmp |= 0x10; // set PKRU[2].AD=1 > wrpkru(tmp); > > While another thread does: > > pkey4 = pkey_alloc(); // pkey4=4 > sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS); > > Without some kind of locking, that's going to race. We could do all the > locking in the kernel, but that requires that the kernel do all the PKRU > writing, which I'd really like to avoid. > > I think the closest we can get reasonably is to have the kernel track > the baseline_pkru and then allow userspace to query it in case userspace > decides that thread needs to update its thread-local PKRU from the baseline. Yeah, fair point. Let's skip the modify-other-threads thing. Perhaps this is silly, but what if the default were changed to deny reads and writes for unallocated keys? Is there a use case that breaks? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 03:16 PM, Andy Lutomirski wrote: > Hrm. We might also want an option to change pkru and/or baseline_pkru > in all threads in the current mm. That's optional but it could be > handy. Maybe it would be as simple as having the allocate-a-pkey call > have an option to set an initial baseline value and an option to > propagate that initial value to pre-existing threads. Do you mean actively going in and changing PKRU in other threads? I fear that will be dangerous. IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls are allowed (somehow). Raw in this case would mean a thread calling WRPKRU without a system call and without checking in with what any other threads are doing. Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using it for execute-only). Now, a thread is trying to do this: pkey2 = sys_pkey_alloc(); // now pkey2=2 tmp = rdpkru(); // 0x004 tmp |= 0x10; // set PKRU[2].AD=1 wrpkru(tmp); While another thread does: pkey4 = pkey_alloc(); // pkey4=4 sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS); Without some kind of locking, that's going to race. We could do all the locking in the kernel, but that requires that the kernel do all the PKRU writing, which I'd really like to avoid. I think the closest we can get reasonably is to have the kernel track the baseline_pkru and then allow userspace to query it in case userspace decides that thread needs to update its thread-local PKRU from the baseline. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 02:28 PM, Andy Lutomirski wrote: ... >> I could imagine that some kernel person would want to use even more >> keys, but I think two fixed keys are kind of the minimal we'd want to >> use. > > I imagine we'd reserve key 0 for normal page and key 1 for deny-read. > Let me be a bit more concrete about what I'm suggesting: > > We'd have thread_struct.baseline_pkru. It would start with key 0 > allowing all access and key 1 denying reads. Are you sure thread_struct is the right place for this? I think of signal handlers as a process-wide thing, and it seems a bit goofy if we have the PKRU value in a signal handler depend on the PKRU of the thread that got interrupted. > We'd have a syscall like set_protection_key that could allocate unused > keys and change the values of keys that have been allocated. Those > changes would be reflected in baseline_pkru. Changes to keys 0 and 1 > in baseline_pkru would not be allowed. FWIW, I think we can do this without *actually* dedicating key 1 to execute-only. But that's a side issue. > Signal delivery would load baseline_pkru into the PKRU register. > Signal restore would restore PKRU to its previous value. Do you really mean "its previous value" or are you OK with the existing behavior which restores PKRU from the XSAVE buffer in the sigcontext? > WRPKRU would, of course, override baseline_pkru, but it wouldn't > change baseline_pkru. The set_protection_key syscall would modify > *both* real PKRU and baseline_pkru. How about this: We make baseline_pkru a process-wide baseline and store it in mm->context. That way, no matter which thread gets interrupted for a signal, they see consistent values. We only write to it when an app _specifically_ asks for it to be updated with a special flag to sys_pkey_set(). When an app uses the execute-only support, we implicitly set the read-disable bit in baseline_pkru for the execute-only pkey. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/21/2015 02:52 PM, Andy Lutomirski wrote: > Perhaps this is silly, but what if the default were changed to deny > reads and writes for unallocated keys? Is there a use case that > breaks? It's probably a reasonable debugging feature. But, anything that takes an XSAVE feature out of its "init state" has the potential to do a bit of harm because it increases the potential size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably don't want to go out of our way to take things out of the init state when we're unsure of the benefits. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/21/2015 03:02 PM, Andy Lutomirski wrote: > On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen > <dave.han...@linux.intel.com> wrote: >> On 12/21/2015 02:52 PM, Andy Lutomirski wrote: >>> Perhaps this is silly, but what if the default were changed to deny >>> reads and writes for unallocated keys? Is there a use case that >>> breaks? >> >> It's probably a reasonable debugging feature. >> >> But, anything that takes an XSAVE feature out of its "init state" has >> the potential to do a bit of harm because it increases the potential >> size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably >> don't want to go out of our way to take things out of the init state >> when we're unsure of the benefits. > > Aren't you already doing that with your magic execute-only thing? Yep, but that's with a concrete benefit in mind. > Also, if we ever do the deferred-xstate-restore thing that Rik was > playing with awhile back, then we'll want to switch to using rdpkru > and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of > the XRSTOR and XSAVEOPT state, and this particular issue will become > irrelevant. Yep, agreed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansenwrote: > On 12/18/2015 02:28 PM, Andy Lutomirski wrote: > ... >>> I could imagine that some kernel person would want to use even more >>> keys, but I think two fixed keys are kind of the minimal we'd want to >>> use. >> >> I imagine we'd reserve key 0 for normal page and key 1 for deny-read. >> Let me be a bit more concrete about what I'm suggesting: >> >> We'd have thread_struct.baseline_pkru. It would start with key 0 >> allowing all access and key 1 denying reads. > > Are you sure thread_struct is the right place for this? I think of > signal handlers as a process-wide thing, and it seems a bit goofy if we > have the PKRU value in a signal handler depend on the PKRU of the thread > that got interrupted. I think you're right. mmu_context_t might be a better choice. > >> We'd have a syscall like set_protection_key that could allocate unused >> keys and change the values of keys that have been allocated. Those >> changes would be reflected in baseline_pkru. Changes to keys 0 and 1 >> in baseline_pkru would not be allowed. > > FWIW, I think we can do this without *actually* dedicating key 1 to > execute-only. But that's a side issue. > >> Signal delivery would load baseline_pkru into the PKRU register. >> Signal restore would restore PKRU to its previous value. > > Do you really mean "its previous value" or are you OK with the existing > behavior which restores PKRU from the XSAVE buffer in the sigcontext? By "its previous value" I meant the value in the XSAVE buffer in the sigcontext. So I think I'm okay with that :) > >> WRPKRU would, of course, override baseline_pkru, but it wouldn't >> change baseline_pkru. The set_protection_key syscall would modify >> *both* real PKRU and baseline_pkru. > > How about this: > > We make baseline_pkru a process-wide baseline and store it in > mm->context. That way, no matter which thread gets interrupted for a > signal, they see consistent values. We only write to it when an app > _specifically_ asks for it to be updated with a special flag to > sys_pkey_set(). > > When an app uses the execute-only support, we implicitly set the > read-disable bit in baseline_pkru for the execute-only pkey. Sounds good, I think. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Dec 22, 2015 2:04 AM, "Dave Hansen"wrote: > > On 12/18/2015 03:16 PM, Andy Lutomirski wrote: > > Hrm. We might also want an option to change pkru and/or baseline_pkru > > in all threads in the current mm. That's optional but it could be > > handy. Maybe it would be as simple as having the allocate-a-pkey call > > have an option to set an initial baseline value and an option to > > propagate that initial value to pre-existing threads. > > Do you mean actively going in and changing PKRU in other threads? I > fear that will be dangerous. > > IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls > are allowed (somehow). Raw in this case would mean a thread calling > WRPKRU without a system call and without checking in with what any other > threads are doing. > > Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using > it for execute-only). Now, a thread is trying to do this: > > pkey2 = sys_pkey_alloc(); // now pkey2=2 > tmp = rdpkru(); // 0x004 > tmp |= 0x10; // set PKRU[2].AD=1 > wrpkru(tmp); > > While another thread does: > > pkey4 = pkey_alloc(); // pkey4=4 > sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS); > > Without some kind of locking, that's going to race. We could do all the > locking in the kernel, but that requires that the kernel do all the PKRU > writing, which I'd really like to avoid. > > I think the closest we can get reasonably is to have the kernel track > the baseline_pkru and then allow userspace to query it in case userspace > decides that thread needs to update its thread-local PKRU from the baseline. Yeah, fair point. Let's skip the modify-other-threads thing. Perhaps this is silly, but what if the default were changed to deny reads and writes for unallocated keys? Is there a use case that breaks? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansenwrote: > On 12/21/2015 02:52 PM, Andy Lutomirski wrote: >> Perhaps this is silly, but what if the default were changed to deny >> reads and writes for unallocated keys? Is there a use case that >> breaks? > > It's probably a reasonable debugging feature. > > But, anything that takes an XSAVE feature out of its "init state" has > the potential to do a bit of harm because it increases the potential > size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably > don't want to go out of our way to take things out of the init state > when we're unsure of the benefits. Aren't you already doing that with your magic execute-only thing? Also, if we ever do the deferred-xstate-restore thing that Rik was playing with awhile back, then we'll want to switch to using rdpkru and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of the XRSTOR and XSAVEOPT state, and this particular issue will become irrelevant. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 03:16 PM, Andy Lutomirski wrote: > Hrm. We might also want an option to change pkru and/or baseline_pkru > in all threads in the current mm. That's optional but it could be > handy. Maybe it would be as simple as having the allocate-a-pkey call > have an option to set an initial baseline value and an option to > propagate that initial value to pre-existing threads. Do you mean actively going in and changing PKRU in other threads? I fear that will be dangerous. IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls are allowed (somehow). Raw in this case would mean a thread calling WRPKRU without a system call and without checking in with what any other threads are doing. Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using it for execute-only). Now, a thread is trying to do this: pkey2 = sys_pkey_alloc(); // now pkey2=2 tmp = rdpkru(); // 0x004 tmp |= 0x10; // set PKRU[2].AD=1 wrpkru(tmp); While another thread does: pkey4 = pkey_alloc(); // pkey4=4 sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS); Without some kind of locking, that's going to race. We could do all the locking in the kernel, but that requires that the kernel do all the PKRU writing, which I'd really like to avoid. I think the closest we can get reasonably is to have the kernel track the baseline_pkru and then allow userspace to query it in case userspace decides that thread needs to update its thread-local PKRU from the baseline. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 3:16 PM, Andy Lutomirski wrote: > > Yes, I think. If I'm using protection keys to protect some critical > data structure (important stuff in shared memory, important memory > mapped files, pmem, etc), then I'll allocate a protection key and set > PKRU to deny writes. The problem is that I really, really want writes > denied except when explicitly enabled in narrow regions of code that > use wrpkru to enable them, and I don't want an asynchronous signal > delivered in those narrow regions of code or newly cloned threads to > pick up the write-allow value. So I want baseline_pkru to have the > deny writes entry. Hmm. Ok, that does sound like a valid and interesting usage case. Fair enough. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 3:08 PM, Linus Torvalds wrote: > On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski wrote: >> >> Apps that don't want to use the baseline_pkru mechanism could use >> syscalls to claim ownership of protection keys but then manage them >> purely with WRPKRU directly. We could optionally disallow >> mprotect_key on keys that weren't allocated in advance. >> >> Does that seem sane? > > So everything seems sane except for the need for that baseline_pkru. > > I'm not seeing why it couldn't just be a fixed value. Is there any > real downside to it? Yes, I think. If I'm using protection keys to protect some critical data structure (important stuff in shared memory, important memory mapped files, pmem, etc), then I'll allocate a protection key and set PKRU to deny writes. The problem is that I really, really want writes denied except when explicitly enabled in narrow regions of code that use wrpkru to enable them, and I don't want an asynchronous signal delivered in those narrow regions of code or newly cloned threads to pick up the write-allow value. So I want baseline_pkru to have the deny writes entry. I think I would do exactly this in my production code here if my server supported it. Some day... Hrm. We might also want an option to change pkru and/or baseline_pkru in all threads in the current mm. That's optional but it could be handy. Maybe it would be as simple as having the allocate-a-pkey call have an option to set an initial baseline value and an option to propagate that initial value to pre-existing threads. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski wrote: > > Apps that don't want to use the baseline_pkru mechanism could use > syscalls to claim ownership of protection keys but then manage them > purely with WRPKRU directly. We could optionally disallow > mprotect_key on keys that weren't allocated in advance. > > Does that seem sane? So everything seems sane except for the need for that baseline_pkru. I'm not seeing why it couldn't just be a fixed value. Is there any real downside to it? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 1:45 PM, Linus Torvalds wrote: > On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen > wrote: >> >> But, if we are picking out an execute-only pkey more dynamically, we've >> got to keep the default value for the entire process somewhere. > > How dynamic do we want to make this, though? > > I haven't looked at the details, and perhaps more importantly, I don't > know what exactly are the requirements you've gotten from the people > who are expected to actually use this. > > I think we might want to hardcode a couple of keys as "kernel > reserved". And I'd rather reserve them up-front than have some user > program be unhappy later when we want to use them. > > I guess we want to leave key #0 for "normal page", so my suggesting to > use that for the execute-only was probably misguided. > > But I do think we might want to have that "no read access" as a real > fixed key too, because I think the kernel itself would want to use it: > > (a) to make sure that it gets the right fault when user space passes > in a execute-only address to a system call. > > (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings. > > so I do think that we'd want to reserve two of the 16 keys up front. > > Would it be ok for the expected users to have those keys simply be > fixed? With key 0 being used for all default pages, and key 1 being > used for all execute-only pages? And then defaulting PKRU to 4, > disallowing access to that key #1? > > I could imagine that some kernel person would want to use even more > keys, but I think two fixed keys are kind of the minimal we'd want to > use. I imagine we'd reserve key 0 for normal page and key 1 for deny-read. Let me be a bit more concrete about what I'm suggesting: We'd have thread_struct.baseline_pkru. It would start with key 0 allowing all access and key 1 denying reads. We'd have a syscall like set_protection_key that could allocate unused keys and change the values of keys that have been allocated. Those changes would be reflected in baseline_pkru. Changes to keys 0 and 1 in baseline_pkru would not be allowed. Signal delivery would load baseline_pkru into the PKRU register. Signal restore would restore PKRU to its previous value. WRPKRU would, of course, override baseline_pkru, but it wouldn't change baseline_pkru. The set_protection_key syscall would modify *both* real PKRU and baseline_pkru. Apps that don't want to use the baseline_pkru mechanism could use syscalls to claim ownership of protection keys but then manage them purely with WRPKRU directly. We could optionally disallow mprotect_key on keys that weren't allocated in advance. Does that seem sane? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen wrote: > > But, if we are picking out an execute-only pkey more dynamically, we've > got to keep the default value for the entire process somewhere. How dynamic do we want to make this, though? I haven't looked at the details, and perhaps more importantly, I don't know what exactly are the requirements you've gotten from the people who are expected to actually use this. I think we might want to hardcode a couple of keys as "kernel reserved". And I'd rather reserve them up-front than have some user program be unhappy later when we want to use them. I guess we want to leave key #0 for "normal page", so my suggesting to use that for the execute-only was probably misguided. But I do think we might want to have that "no read access" as a real fixed key too, because I think the kernel itself would want to use it: (a) to make sure that it gets the right fault when user space passes in a execute-only address to a system call. (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings. so I do think that we'd want to reserve two of the 16 keys up front. Would it be ok for the expected users to have those keys simply be fixed? With key 0 being used for all default pages, and key 1 being used for all execute-only pages? And then defaulting PKRU to 4, disallowing access to that key #1? I could imagine that some kernel person would want to use even more keys, but I think two fixed keys are kind of the minimal we'd want to use. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 01:04 PM, Linus Torvalds wrote: > On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirski wrote: >> >> IOW, I like my idea in which signal delivery always sets PKRU to the >> application-requested-by-syscall values and sigreturn restores it. > > So I don't mind that, as long as the whole "sigreturn restores it" is > part of things. > > Your original email with the suggestion to *not* resture PKRU I didn't > like. Setting it and restoring it is fine. > > I do wonder if you need an explicit value, though. I think it's > reasonable to say that PKRU value 0 is special. It's what we'd start > processes with, and why not just say that it's what we run signal > handlers in? > > Would any other value ever make sense, really? Having a PKRU with the execute-only permissions set is the only one I can think of. For a system with a _dedicated_ PKEY for execute-only, this is easy and could even be made a part of init_fpstate with no other code changes. But, if we are picking out an execute-only pkey more dynamically, we've got to keep the default value for the entire process somewhere. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 1:04 PM, Linus Torvalds wrote: > > I do wonder if you need an explicit value, though. I think it's > reasonable to say that PKRU value 0 is special. It's what we'd start > processes with, and why not just say that it's what we run signal > handlers in? > > Would any other value ever make sense, really? Ahh. Your point about the PROT_EXEC handling means that maybe we don't want to default to zero. Maybe we want to make the default PKRU startup value be 1 instead, enabling access disable key for key 0? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 01:02 PM, Andy Lutomirski wrote: > On Fri, Dec 18, 2015 at 12:58 PM, H. Peter Anvin wrote: >> > On 12/18/2015 12:49 PM, Andy Lutomirski wrote: >>> >> >>> >> IOW, I like my idea in which signal delivery always sets PKRU to the >>> >> application-requested-by-syscall values and sigreturn restores it. >>> >> Kinda like sigaltstack, but applies to all signals and affects PKRU >>> >> instead of RSP. >>> >> >> > I think this is the only sensible option, with the default being all zero. >> > > Or not quite all zero if we do Dave's experimental PROT_EXEC thing. > > Actually, I want to introduce a set of per-mm "incompatible" bits. By > default, they'd be zero. We can, as needed, define bits that do > something nice but break old code. I want one of the bits to turn > vsyscalls off entirely. Another bit could say that the kernel is > allowed to steal a protection key for PROT_EXEC. That really only makes sense if we have userspace that expects all the protection keys to be available. If we go the route of having pkey_alloc/free() syscalls, then the kernel can easily tell userspace to keep its mitts off a particular pkey by not ever returning it from a pkey_alloc(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirski wrote: > > IOW, I like my idea in which signal delivery always sets PKRU to the > application-requested-by-syscall values and sigreturn restores it. So I don't mind that, as long as the whole "sigreturn restores it" is part of things. Your original email with the suggestion to *not* resture PKRU I didn't like. Setting it and restoring it is fine. I do wonder if you need an explicit value, though. I think it's reasonable to say that PKRU value 0 is special. It's what we'd start processes with, and why not just say that it's what we run signal handlers in? Would any other value ever make sense, really? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:58 PM, H. Peter Anvin wrote: > On 12/18/2015 12:49 PM, Andy Lutomirski wrote: >> >> IOW, I like my idea in which signal delivery always sets PKRU to the >> application-requested-by-syscall values and sigreturn restores it. >> Kinda like sigaltstack, but applies to all signals and affects PKRU >> instead of RSP. >> > > I think this is the only sensible option, with the default being all zero. > Or not quite all zero if we do Dave's experimental PROT_EXEC thing. Actually, I want to introduce a set of per-mm "incompatible" bits. By default, they'd be zero. We can, as needed, define bits that do something nice but break old code. I want one of the bits to turn vsyscalls off entirely. Another bit could say that the kernel is allowed to steal a protection key for PROT_EXEC. These bits would be read and written by prctl, but there could also be an ELF note mechanism to initialize them on execve without a syscall. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 12:49 PM, Andy Lutomirski wrote: > > IOW, I like my idea in which signal delivery always sets PKRU to the > application-requested-by-syscall values and sigreturn restores it. > Kinda like sigaltstack, but applies to all signals and affects PKRU > instead of RSP. > I think this is the only sensible option, with the default being all zero. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:37 PM, Linus Torvalds wrote: > On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen > wrote: >> >> But, I think it's a small one. Basically, RSP would have to pointing at >> a place which was allowed by protection keys for all of the sigframe >> setup. > > Note that the whole "stack is special" is not at all a new issue. > > It's the main reason why sigaltstack() and SS_ONSTACK exists. This is > in no way new to PKRU, people have had to handle the issue of > stack-related SIGSEGV faults for a long time. > > So any application that uses PKRU and may play games that affects the > stack, will always have to have a separate "safe stack" that it uses > for signal handling. But that is in no way PKRU-specific, it's been > the case for a lot of other memory management faults. > The trouble is that this isn't just limited to special "safe stack" SA_ONSTACK aware code. It even affects normally innocuous things. The thing that's new is that, before PKRU, the HW stack might have been unsafe in the sense that accessing it would fault and we need a different backup stack for some signals. With PKRU, we might be in a transient extra-privileged (permissive PKRU) state, and we want to tighten permissions for *all* signal deliveries so that, in case the current stack is erroneous, we don't corrupt it. Admittedly, ending up with permissive PKRU in some short sequence and corrupt RSP is likely to corrupt things no matter what the kernel does. But I don't think we want to deliver SIGALRM to a library while the main program is in a specially permissive PKRU region. IOW, I like my idea in which signal delivery always sets PKRU to the application-requested-by-syscall values and sigreturn restores it. Kinda like sigaltstack, but applies to all signals and affects PKRU instead of RSP. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen wrote: > > But, I think it's a small one. Basically, RSP would have to pointing at > a place which was allowed by protection keys for all of the sigframe > setup. Note that the whole "stack is special" is not at all a new issue. It's the main reason why sigaltstack() and SS_ONSTACK exists. This is in no way new to PKRU, people have had to handle the issue of stack-related SIGSEGV faults for a long time. So any application that uses PKRU and may play games that affects the stack, will always have to have a separate "safe stack" that it uses for signal handling. But that is in no way PKRU-specific, it's been the case for a lot of other memory management faults. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen wrote: > On 12/18/2015 11:21 AM, Andy Lutomirski wrote: >> On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen >> wrote: >>> On 12/18/2015 08:04 AM, Andy Lutomirski wrote: 1b. If the app malfunctions such that RSP points to pmem, the kernel MUST NOT clobber the pmem space. I think that this basically mandates that PKRU needs to have some safe state (i.e. definitely not the init state) on signal delivery: the kernel is going to write a signal frame at the address identified by RSP, and that address is in pmem, so those writes need to fail. >>> >>> The kernel is writing the signal frame using normal old copy_to_user(). >>> Those are writing through mappings with _PAGE_USER set and should be >>> subject to the PKRU state of the thread before the signal started to be >>> delivered. >>> >>> We don't do the fpu__clear() until after this copy, so I think pkeys >>> enforcement is being done properly for this today. >> >> True, but I think only in a very limited sense. Your average signal >> handler is reasonably like to execute "push $rbp" as its very first >> instruction, at which point we're immediately screwed with the current >> arrangement. > > I completely agree that there's a window for corruption. > > But, I think it's a small one. Basically, RSP would have to pointing at > a place which was allowed by protection keys for all of the sigframe > setup. Then, _just_ happened to be at a place which was denied by > protection keys when it enters the signal handler back in userspace. > It's possible, but it's a small window. That's true. OTOH, I think that the signal in the middle of the wrpkru-allowed window is also a compelling case. So maybe my new preference is option B, where we save and restore PKRU like all the other xfeatures but, rather than clearing it in the signal context, we set it to the syscall-specified value. If we do this, then we'd have to clearly separate the syscall-specified value from the xstate value, and we'd probably want one of the syscalls to offer an option to force user PKRU to match the syscall value. Maybe your code already does this -- I didn't check. This has the nice property that sigreturn isn't affected at all except insofar as we should give a little bit of thought to the ordering of restoring PKRU relative to any other uaccess. I'd imagine that we'd just restore PKRU last, in case the stack we're restoring from has access disallowed by the saved PKRU value. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 11:21 AM, Andy Lutomirski wrote: > On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen > wrote: >> On 12/18/2015 08:04 AM, Andy Lutomirski wrote: >>> 1b. If the app malfunctions such that RSP points to pmem, the kernel >>> MUST NOT clobber the pmem space. I think that this basically mandates >>> that PKRU needs to have some safe state (i.e. definitely not the init >>> state) on signal delivery: the kernel is going to write a signal frame >>> at the address identified by RSP, and that address is in pmem, so >>> those writes need to fail. >> >> The kernel is writing the signal frame using normal old copy_to_user(). >> Those are writing through mappings with _PAGE_USER set and should be >> subject to the PKRU state of the thread before the signal started to be >> delivered. >> >> We don't do the fpu__clear() until after this copy, so I think pkeys >> enforcement is being done properly for this today. > > True, but I think only in a very limited sense. Your average signal > handler is reasonably like to execute "push $rbp" as its very first > instruction, at which point we're immediately screwed with the current > arrangement. I completely agree that there's a window for corruption. But, I think it's a small one. Basically, RSP would have to pointing at a place which was allowed by protection keys for all of the sigframe setup. Then, _just_ happened to be at a place which was denied by protection keys when it enters the signal handler back in userspace. It's possible, but it's a small window. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen wrote: > On 12/18/2015 08:04 AM, Andy Lutomirski wrote: >> 1b. If the app malfunctions such that RSP points to pmem, the kernel >> MUST NOT clobber the pmem space. I think that this basically mandates >> that PKRU needs to have some safe state (i.e. definitely not the init >> state) on signal delivery: the kernel is going to write a signal frame >> at the address identified by RSP, and that address is in pmem, so >> those writes need to fail. > > The kernel is writing the signal frame using normal old copy_to_user(). > Those are writing through mappings with _PAGE_USER set and should be > subject to the PKRU state of the thread before the signal started to be > delivered. > > We don't do the fpu__clear() until after this copy, so I think pkeys > enforcement is being done properly for this today. True, but I think only in a very limited sense. Your average signal handler is reasonably like to execute "push $rbp" as its very first instruction, at which point we're immediately screwed with the current arrangement. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 08:04 AM, Andy Lutomirski wrote: > 1b. If the app malfunctions such that RSP points to pmem, the kernel > MUST NOT clobber the pmem space. I think that this basically mandates > that PKRU needs to have some safe state (i.e. definitely not the init > state) on signal delivery: the kernel is going to write a signal frame > at the address identified by RSP, and that address is in pmem, so > those writes need to fail. The kernel is writing the signal frame using normal old copy_to_user(). Those are writing through mappings with _PAGE_USER set and should be subject to the PKRU state of the thread before the signal started to be delivered. We don't do the fpu__clear() until after this copy, so I think pkeys enforcement is being done properly for this today. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 08:04 AM, Andy Lutomirski wrote: > Which reminds me: __get_user, etc all respect PKRU because the SDM > says that PKRU affects all user *and* kernel accesses to user memory > (i.e. anything with _PAGE_USER). Should get_user_pages also respect > PKRU? It does. There are a couple of parts to it, but the most important are in these patches: > http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018=c64acbe036487d07968ea6a5b9090169d6e40160 > http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018=331c62eb789f969e187964e28ac80b1e0e26b69d -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Thu, Dec 17, 2015 at 10:43 PM, H. Peter Anvin wrote: > On December 17, 2015 9:29:21 PM PST, Andy Lutomirski > wrote: >>On Dec 17, 2015 6:53 PM, "Dave Hansen" >>wrote: >>> >>> On 12/17/2015 06:32 PM, Andy Lutomirski wrote: >>> > On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen >> wrote: >>> >> But what about the register state when delivering a signal? Don't >>we >>> >> set the registers to the init state? Do we need to preserve PKRU >>state >>> >> instead of init'ing it? The init state _is_ nice here because it >>is >>> >> permissive allows us to do something useful no matter what PKRU >>gets set to. >>> > >>> > I think we leave the extended regs alone. Don't we? >>> > >>> > I think that, for most signals, we want to leave PKRU as is, >>> > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we >>want >>> > an option to reset PKRU on delivery (and then set the flag to >>restore >>> > on return?). >>> >>> Is there some precedent for doing the state differently for different >>> signals? >> >>Yes, to a very limited extent: SA_ONSTACK. >> >>> >>> >> Well, the signal handler isn't necessarily going to clobber it, >>but the >>> >> delivery code already clobbers it when going to the init state. >>> > >>> > Can you point to that code? >>> >>> handle_signal() -> fpu__clear() >>> >>> The comment around it says: >>> >>> "Ensure the signal handler starts with the new fpu state." >>> >> >>You win this round :) >> >>So maybe we should have a mask of xfeatures that aren't cleared on >>signal delivery (e.g. PKRU, perhaps) and that are, by default, >>preserved across signal returns. >> [...] > > I find the notion of PKRU not being initialized at the entry to a signal > handler a bit odd. Saving/restoring it seems like the right thing to me. > > What am I missing? On Fri, Dec 18, 2015 at 12:32 AM, Ingo Molnar wrote: > > So the principle is this: signals are conceptually like creating a new thread > of > execution, and that's a very powerful programming concept, like fork() or > pthread_create() are powerful concepts. So we definitely want to keep that > default > behavior, and I don't think we want to deviate from that for typical new > extended > CPU context features, even if signal delivery slows down as a result. I think that PKRU is in almost no respect a typical no extended CPU context feature. It's a control register, not a data register. Let's try a concrete example. One big use case for PKRU will be in conjunction with DAX-style pmem. Imagine some program (database server, perhaps) that mmaps 1 TB of pmem MAP_SHARED, PROT_READ | PROT_WRITE with protection key 1. It sets PKRU so that key 1 can't be written. All of its accesses to the pmem store are in tight code regions that do wrpkru; mov; wrpkru (just like what the kernel does with stac/clac on SMAP systems). >From the app's perspective, it's imperative that stray writes that coincidentally hit the pmem region fail. (1 TB is big enough that a random write to canonical user memory hits it almost 1% of the time.) This means: 1. If a signal is delivered, the app wants PKRU to be set to some safe value as early as possible. The app can do it by itself if it doesn't use libraries that interfere, but it would be IMO much, much nicer if the kernel made this happen automatically. 1b. If the app malfunctions such that RSP points to pmem, the kernel MUST NOT clobber the pmem space. I think that this basically mandates that PKRU needs to have some safe state (i.e. definitely not the init state) on signal delivery: the kernel is going to write a signal frame at the address identified by RSP, and that address is in pmem, so those writes need to fail. 2. clone with shared VM should preserve PKRU. So in this regard, I think signals are kind of like new threads after all. Now that I think about it more, there are at least two possibilities that work for this use case. Option A: signal delivery preserves PKRU. If we go this route, I think we should decide whether PKRU is preserved on exit. Given that we're talking about adding syscalls to adjust PKRU, it seems a bit odd to me that sigreturn would, by default, undo whatever those syscalls do. Option B: signal delivery resets PKRU to user-specified values. We're talking about having syscalls to write PKRU. We could have clone and signal entry use the values set by syscall instead of the values in the actual PKRU register. That way: set_protection_key(1, no writes); ... wrpkru(allow writes to key 1) mov something<<< fault or async signal here wrpkru(disallow writes to key 1) leaves key 1 protected in the signal handler. If we go that route, I think we must restore PKRU just like any other xstate on sigreturn, so in some sense it's the simplest of all to implement. We'll need to change the signal delivery stuff to do the fpu__clear and the automatic PKRU write before trying to set up the signal frame so that the frame is written with the syscall-specified protections
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:59:14AM -0800, Christoph Hellwig wrote: > Stupid question, but what the heck is PKRU? A grep of the kernel tree > shows no results, and a web search returns mostly Thai language results. That should explain it: https://lkml.kernel.org/r/20151214190542.39c48...@viggo.jf.intel.com -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Thu, Dec 17, 2015 at 05:48:56PM -0800, Andy Lutomirski wrote: > Hi all- > > I think that, for PKRU in particular, we want the default signal > handling behavior to be a bit unusual. Stupid question, but what the heck is PKRU? A grep of the kernel tree shows no results, and a web search returns mostly Thai language results. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
* Andy Lutomirski wrote: > > >> But what about the register state when delivering a signal? Don't we > > >> set the registers to the init state? Do we need to preserve PKRU state > > >> instead of init'ing it? The init state _is_ nice here because it is > > >> permissive allows us to do something useful no matter what PKRU gets set > > >> to. > > > > > > I think we leave the extended regs alone. Don't we? > > > > > > I think that, for most signals, we want to leave PKRU as is, > > > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we want > > > an option to reset PKRU on delivery (and then set the flag to restore > > > on return?). > > > > Is there some precedent for doing the state differently for different > > signals? > > Yes, to a very limited extent: SA_ONSTACK. > > > > > >> Well, the signal handler isn't necessarily going to clobber it, but the > > >> delivery code already clobbers it when going to the init state. > > > > > > Can you point to that code? > > > > handle_signal() -> fpu__clear() > > > > The comment around it says: > > > > "Ensure the signal handler starts with the new fpu state." > > > > You win this round :) > > So maybe we should have a mask of xfeatures that aren't cleared on > signal delivery (e.g. PKRU, perhaps) and that are, by default, > preserved across signal returns. So the principle is this: signals are conceptually like creating a new thread of execution, and that's a very powerful programming concept, like fork() or pthread_create() are powerful concepts. So we definitely want to keep that default behavior, and I don't think we want to deviate from that for typical new extended CPU context features, even if signal delivery slows down as a result. But we've been arguing about 'lightweight signals' for up to two decades that I can remember. (The first such suggestion was to not save the FPU state, back when FPU saves were ridiculously slow compared to other parts of saving/restoring a context.) So having a well-enumerated, extensible opt-in mask (which defaults to 'all state saved') that allows smart signal handlers to skip the save/restore of certain CPU context components would be acceptable I think. But I'd still expect this to be limited to closely coded, specialistic signal handlers - as the trend goes against such type of specialization: compilers and runtime environments do take advantage of new CPU features so if you want to have an 'easy to use' signal handler, you should use the default one. I'd not be surprised if large-scale signal users like Valgrind could benefit. > > I'm sure we can preserve it, we just need to be _careful_. > > Right. > > How much does XSAVEOPT help here? IOW if we're careful to save to the same > place we restored from and we don't modify the state in the mean time, how > much > of the time do we save? In the best case, I guess we save the memory writes > but > not the reads? So I'd not design new signal interfaces around current behavior, I'd design them for the existing patterns (which center around programming ease of use) - with opt-in, performance-enhancing specializations. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Thu, Dec 17, 2015 at 05:48:56PM -0800, Andy Lutomirski wrote: > Hi all- > > I think that, for PKRU in particular, we want the default signal > handling behavior to be a bit unusual. Stupid question, but what the heck is PKRU? A grep of the kernel tree shows no results, and a web search returns mostly Thai language results. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
* Andy Lutomirski <l...@amacapital.net> wrote: > > >> But what about the register state when delivering a signal? Don't we > > >> set the registers to the init state? Do we need to preserve PKRU state > > >> instead of init'ing it? The init state _is_ nice here because it is > > >> permissive allows us to do something useful no matter what PKRU gets set > > >> to. > > > > > > I think we leave the extended regs alone. Don't we? > > > > > > I think that, for most signals, we want to leave PKRU as is, > > > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we want > > > an option to reset PKRU on delivery (and then set the flag to restore > > > on return?). > > > > Is there some precedent for doing the state differently for different > > signals? > > Yes, to a very limited extent: SA_ONSTACK. > > > > > >> Well, the signal handler isn't necessarily going to clobber it, but the > > >> delivery code already clobbers it when going to the init state. > > > > > > Can you point to that code? > > > > handle_signal() -> fpu__clear() > > > > The comment around it says: > > > > "Ensure the signal handler starts with the new fpu state." > > > > You win this round :) > > So maybe we should have a mask of xfeatures that aren't cleared on > signal delivery (e.g. PKRU, perhaps) and that are, by default, > preserved across signal returns. So the principle is this: signals are conceptually like creating a new thread of execution, and that's a very powerful programming concept, like fork() or pthread_create() are powerful concepts. So we definitely want to keep that default behavior, and I don't think we want to deviate from that for typical new extended CPU context features, even if signal delivery slows down as a result. But we've been arguing about 'lightweight signals' for up to two decades that I can remember. (The first such suggestion was to not save the FPU state, back when FPU saves were ridiculously slow compared to other parts of saving/restoring a context.) So having a well-enumerated, extensible opt-in mask (which defaults to 'all state saved') that allows smart signal handlers to skip the save/restore of certain CPU context components would be acceptable I think. But I'd still expect this to be limited to closely coded, specialistic signal handlers - as the trend goes against such type of specialization: compilers and runtime environments do take advantage of new CPU features so if you want to have an 'easy to use' signal handler, you should use the default one. I'd not be surprised if large-scale signal users like Valgrind could benefit. > > I'm sure we can preserve it, we just need to be _careful_. > > Right. > > How much does XSAVEOPT help here? IOW if we're careful to save to the same > place we restored from and we don't modify the state in the mean time, how > much > of the time do we save? In the best case, I guess we save the memory writes > but > not the reads? So I'd not design new signal interfaces around current behavior, I'd design them for the existing patterns (which center around programming ease of use) - with opt-in, performance-enhancing specializations. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:59:14AM -0800, Christoph Hellwig wrote: > Stupid question, but what the heck is PKRU? A grep of the kernel tree > shows no results, and a web search returns mostly Thai language results. That should explain it: https://lkml.kernel.org/r/20151214190542.39c48...@viggo.jf.intel.com -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Thu, Dec 17, 2015 at 10:43 PM, H. Peter Anvinwrote: > On December 17, 2015 9:29:21 PM PST, Andy Lutomirski > wrote: >>On Dec 17, 2015 6:53 PM, "Dave Hansen" >>wrote: >>> >>> On 12/17/2015 06:32 PM, Andy Lutomirski wrote: >>> > On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen >> wrote: >>> >> But what about the register state when delivering a signal? Don't >>we >>> >> set the registers to the init state? Do we need to preserve PKRU >>state >>> >> instead of init'ing it? The init state _is_ nice here because it >>is >>> >> permissive allows us to do something useful no matter what PKRU >>gets set to. >>> > >>> > I think we leave the extended regs alone. Don't we? >>> > >>> > I think that, for most signals, we want to leave PKRU as is, >>> > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we >>want >>> > an option to reset PKRU on delivery (and then set the flag to >>restore >>> > on return?). >>> >>> Is there some precedent for doing the state differently for different >>> signals? >> >>Yes, to a very limited extent: SA_ONSTACK. >> >>> >>> >> Well, the signal handler isn't necessarily going to clobber it, >>but the >>> >> delivery code already clobbers it when going to the init state. >>> > >>> > Can you point to that code? >>> >>> handle_signal() -> fpu__clear() >>> >>> The comment around it says: >>> >>> "Ensure the signal handler starts with the new fpu state." >>> >> >>You win this round :) >> >>So maybe we should have a mask of xfeatures that aren't cleared on >>signal delivery (e.g. PKRU, perhaps) and that are, by default, >>preserved across signal returns. >> [...] > > I find the notion of PKRU not being initialized at the entry to a signal > handler a bit odd. Saving/restoring it seems like the right thing to me. > > What am I missing? On Fri, Dec 18, 2015 at 12:32 AM, Ingo Molnar wrote: > > So the principle is this: signals are conceptually like creating a new thread > of > execution, and that's a very powerful programming concept, like fork() or > pthread_create() are powerful concepts. So we definitely want to keep that > default > behavior, and I don't think we want to deviate from that for typical new > extended > CPU context features, even if signal delivery slows down as a result. I think that PKRU is in almost no respect a typical no extended CPU context feature. It's a control register, not a data register. Let's try a concrete example. One big use case for PKRU will be in conjunction with DAX-style pmem. Imagine some program (database server, perhaps) that mmaps 1 TB of pmem MAP_SHARED, PROT_READ | PROT_WRITE with protection key 1. It sets PKRU so that key 1 can't be written. All of its accesses to the pmem store are in tight code regions that do wrpkru; mov; wrpkru (just like what the kernel does with stac/clac on SMAP systems). >From the app's perspective, it's imperative that stray writes that coincidentally hit the pmem region fail. (1 TB is big enough that a random write to canonical user memory hits it almost 1% of the time.) This means: 1. If a signal is delivered, the app wants PKRU to be set to some safe value as early as possible. The app can do it by itself if it doesn't use libraries that interfere, but it would be IMO much, much nicer if the kernel made this happen automatically. 1b. If the app malfunctions such that RSP points to pmem, the kernel MUST NOT clobber the pmem space. I think that this basically mandates that PKRU needs to have some safe state (i.e. definitely not the init state) on signal delivery: the kernel is going to write a signal frame at the address identified by RSP, and that address is in pmem, so those writes need to fail. 2. clone with shared VM should preserve PKRU. So in this regard, I think signals are kind of like new threads after all. Now that I think about it more, there are at least two possibilities that work for this use case. Option A: signal delivery preserves PKRU. If we go this route, I think we should decide whether PKRU is preserved on exit. Given that we're talking about adding syscalls to adjust PKRU, it seems a bit odd to me that sigreturn would, by default, undo whatever those syscalls do. Option B: signal delivery resets PKRU to user-specified values. We're talking about having syscalls to write PKRU. We could have clone and signal entry use the values set by syscall instead of the values in the actual PKRU register. That way: set_protection_key(1, no writes); ... wrpkru(allow writes to key 1) mov something<<< fault or async signal here wrpkru(disallow writes to key 1) leaves key 1 protected in the signal handler. If we go that route, I think we must restore PKRU just like any other xstate on sigreturn, so in some sense it's the simplest of all to implement. We'll need to change the signal delivery stuff to do the fpu__clear and the automatic PKRU
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 08:04 AM, Andy Lutomirski wrote: > Which reminds me: __get_user, etc all respect PKRU because the SDM > says that PKRU affects all user *and* kernel accesses to user memory > (i.e. anything with _PAGE_USER). Should get_user_pages also respect > PKRU? It does. There are a couple of parts to it, but the most important are in these patches: > http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018=c64acbe036487d07968ea6a5b9090169d6e40160 > http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018=331c62eb789f969e187964e28ac80b1e0e26b69d -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 08:04 AM, Andy Lutomirski wrote: > 1b. If the app malfunctions such that RSP points to pmem, the kernel > MUST NOT clobber the pmem space. I think that this basically mandates > that PKRU needs to have some safe state (i.e. definitely not the init > state) on signal delivery: the kernel is going to write a signal frame > at the address identified by RSP, and that address is in pmem, so > those writes need to fail. The kernel is writing the signal frame using normal old copy_to_user(). Those are writing through mappings with _PAGE_USER set and should be subject to the PKRU state of the thread before the signal started to be delivered. We don't do the fpu__clear() until after this copy, so I think pkeys enforcement is being done properly for this today. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansenwrote: > > But, I think it's a small one. Basically, RSP would have to pointing at > a place which was allowed by protection keys for all of the sigframe > setup. Note that the whole "stack is special" is not at all a new issue. It's the main reason why sigaltstack() and SS_ONSTACK exists. This is in no way new to PKRU, people have had to handle the issue of stack-related SIGSEGV faults for a long time. So any application that uses PKRU and may play games that affects the stack, will always have to have a separate "safe stack" that it uses for signal handling. But that is in no way PKRU-specific, it's been the case for a lot of other memory management faults. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansenwrote: > On 12/18/2015 08:04 AM, Andy Lutomirski wrote: >> 1b. If the app malfunctions such that RSP points to pmem, the kernel >> MUST NOT clobber the pmem space. I think that this basically mandates >> that PKRU needs to have some safe state (i.e. definitely not the init >> state) on signal delivery: the kernel is going to write a signal frame >> at the address identified by RSP, and that address is in pmem, so >> those writes need to fail. > > The kernel is writing the signal frame using normal old copy_to_user(). > Those are writing through mappings with _PAGE_USER set and should be > subject to the PKRU state of the thread before the signal started to be > delivered. > > We don't do the fpu__clear() until after this copy, so I think pkeys > enforcement is being done properly for this today. True, but I think only in a very limited sense. Your average signal handler is reasonably like to execute "push $rbp" as its very first instruction, at which point we're immediately screwed with the current arrangement. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansenwrote: > On 12/18/2015 11:21 AM, Andy Lutomirski wrote: >> On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen >> wrote: >>> On 12/18/2015 08:04 AM, Andy Lutomirski wrote: 1b. If the app malfunctions such that RSP points to pmem, the kernel MUST NOT clobber the pmem space. I think that this basically mandates that PKRU needs to have some safe state (i.e. definitely not the init state) on signal delivery: the kernel is going to write a signal frame at the address identified by RSP, and that address is in pmem, so those writes need to fail. >>> >>> The kernel is writing the signal frame using normal old copy_to_user(). >>> Those are writing through mappings with _PAGE_USER set and should be >>> subject to the PKRU state of the thread before the signal started to be >>> delivered. >>> >>> We don't do the fpu__clear() until after this copy, so I think pkeys >>> enforcement is being done properly for this today. >> >> True, but I think only in a very limited sense. Your average signal >> handler is reasonably like to execute "push $rbp" as its very first >> instruction, at which point we're immediately screwed with the current >> arrangement. > > I completely agree that there's a window for corruption. > > But, I think it's a small one. Basically, RSP would have to pointing at > a place which was allowed by protection keys for all of the sigframe > setup. Then, _just_ happened to be at a place which was denied by > protection keys when it enters the signal handler back in userspace. > It's possible, but it's a small window. That's true. OTOH, I think that the signal in the middle of the wrpkru-allowed window is also a compelling case. So maybe my new preference is option B, where we save and restore PKRU like all the other xfeatures but, rather than clearing it in the signal context, we set it to the syscall-specified value. If we do this, then we'd have to clearly separate the syscall-specified value from the xstate value, and we'd probably want one of the syscalls to offer an option to force user PKRU to match the syscall value. Maybe your code already does this -- I didn't check. This has the nice property that sigreturn isn't affected at all except insofar as we should give a little bit of thought to the ordering of restoring PKRU relative to any other uaccess. I'd imagine that we'd just restore PKRU last, in case the stack we're restoring from has access disallowed by the saved PKRU value. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 11:21 AM, Andy Lutomirski wrote: > On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen >wrote: >> On 12/18/2015 08:04 AM, Andy Lutomirski wrote: >>> 1b. If the app malfunctions such that RSP points to pmem, the kernel >>> MUST NOT clobber the pmem space. I think that this basically mandates >>> that PKRU needs to have some safe state (i.e. definitely not the init >>> state) on signal delivery: the kernel is going to write a signal frame >>> at the address identified by RSP, and that address is in pmem, so >>> those writes need to fail. >> >> The kernel is writing the signal frame using normal old copy_to_user(). >> Those are writing through mappings with _PAGE_USER set and should be >> subject to the PKRU state of the thread before the signal started to be >> delivered. >> >> We don't do the fpu__clear() until after this copy, so I think pkeys >> enforcement is being done properly for this today. > > True, but I think only in a very limited sense. Your average signal > handler is reasonably like to execute "push $rbp" as its very first > instruction, at which point we're immediately screwed with the current > arrangement. I completely agree that there's a window for corruption. But, I think it's a small one. Basically, RSP would have to pointing at a place which was allowed by protection keys for all of the sigframe setup. Then, _just_ happened to be at a place which was denied by protection keys when it enters the signal handler back in userspace. It's possible, but it's a small window. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 3:08 PM, Linus Torvaldswrote: > On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski wrote: >> >> Apps that don't want to use the baseline_pkru mechanism could use >> syscalls to claim ownership of protection keys but then manage them >> purely with WRPKRU directly. We could optionally disallow >> mprotect_key on keys that weren't allocated in advance. >> >> Does that seem sane? > > So everything seems sane except for the need for that baseline_pkru. > > I'm not seeing why it couldn't just be a fixed value. Is there any > real downside to it? Yes, I think. If I'm using protection keys to protect some critical data structure (important stuff in shared memory, important memory mapped files, pmem, etc), then I'll allocate a protection key and set PKRU to deny writes. The problem is that I really, really want writes denied except when explicitly enabled in narrow regions of code that use wrpkru to enable them, and I don't want an asynchronous signal delivered in those narrow regions of code or newly cloned threads to pick up the write-allow value. So I want baseline_pkru to have the deny writes entry. I think I would do exactly this in my production code here if my server supported it. Some day... Hrm. We might also want an option to change pkru and/or baseline_pkru in all threads in the current mm. That's optional but it could be handy. Maybe it would be as simple as having the allocate-a-pkey call have an option to set an initial baseline value and an option to propagate that initial value to pre-existing threads. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 3:16 PM, Andy Lutomirskiwrote: > > Yes, I think. If I'm using protection keys to protect some critical > data structure (important stuff in shared memory, important memory > mapped files, pmem, etc), then I'll allocate a protection key and set > PKRU to deny writes. The problem is that I really, really want writes > denied except when explicitly enabled in narrow regions of code that > use wrpkru to enable them, and I don't want an asynchronous signal > delivered in those narrow regions of code or newly cloned threads to > pick up the write-allow value. So I want baseline_pkru to have the > deny writes entry. Hmm. Ok, that does sound like a valid and interesting usage case. Fair enough. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirskiwrote: > > Apps that don't want to use the baseline_pkru mechanism could use > syscalls to claim ownership of protection keys but then manage them > purely with WRPKRU directly. We could optionally disallow > mprotect_key on keys that weren't allocated in advance. > > Does that seem sane? So everything seems sane except for the need for that baseline_pkru. I'm not seeing why it couldn't just be a fixed value. Is there any real downside to it? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 12:49 PM, Andy Lutomirski wrote: > > IOW, I like my idea in which signal delivery always sets PKRU to the > application-requested-by-syscall values and sigreturn restores it. > Kinda like sigaltstack, but applies to all signals and affects PKRU > instead of RSP. > I think this is the only sensible option, with the default being all zero. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 1:45 PM, Linus Torvaldswrote: > On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen > wrote: >> >> But, if we are picking out an execute-only pkey more dynamically, we've >> got to keep the default value for the entire process somewhere. > > How dynamic do we want to make this, though? > > I haven't looked at the details, and perhaps more importantly, I don't > know what exactly are the requirements you've gotten from the people > who are expected to actually use this. > > I think we might want to hardcode a couple of keys as "kernel > reserved". And I'd rather reserve them up-front than have some user > program be unhappy later when we want to use them. > > I guess we want to leave key #0 for "normal page", so my suggesting to > use that for the execute-only was probably misguided. > > But I do think we might want to have that "no read access" as a real > fixed key too, because I think the kernel itself would want to use it: > > (a) to make sure that it gets the right fault when user space passes > in a execute-only address to a system call. > > (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings. > > so I do think that we'd want to reserve two of the 16 keys up front. > > Would it be ok for the expected users to have those keys simply be > fixed? With key 0 being used for all default pages, and key 1 being > used for all execute-only pages? And then defaulting PKRU to 4, > disallowing access to that key #1? > > I could imagine that some kernel person would want to use even more > keys, but I think two fixed keys are kind of the minimal we'd want to > use. I imagine we'd reserve key 0 for normal page and key 1 for deny-read. Let me be a bit more concrete about what I'm suggesting: We'd have thread_struct.baseline_pkru. It would start with key 0 allowing all access and key 1 denying reads. We'd have a syscall like set_protection_key that could allocate unused keys and change the values of keys that have been allocated. Those changes would be reflected in baseline_pkru. Changes to keys 0 and 1 in baseline_pkru would not be allowed. Signal delivery would load baseline_pkru into the PKRU register. Signal restore would restore PKRU to its previous value. WRPKRU would, of course, override baseline_pkru, but it wouldn't change baseline_pkru. The set_protection_key syscall would modify *both* real PKRU and baseline_pkru. Apps that don't want to use the baseline_pkru mechanism could use syscalls to claim ownership of protection keys but then manage them purely with WRPKRU directly. We could optionally disallow mprotect_key on keys that weren't allocated in advance. Does that seem sane? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirskiwrote: > > IOW, I like my idea in which signal delivery always sets PKRU to the > application-requested-by-syscall values and sigreturn restores it. So I don't mind that, as long as the whole "sigreturn restores it" is part of things. Your original email with the suggestion to *not* resture PKRU I didn't like. Setting it and restoring it is fine. I do wonder if you need an explicit value, though. I think it's reasonable to say that PKRU value 0 is special. It's what we'd start processes with, and why not just say that it's what we run signal handlers in? Would any other value ever make sense, really? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
On 12/18/2015 01:04 PM, Linus Torvalds wrote: > On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirskiwrote: >> >> IOW, I like my idea in which signal delivery always sets PKRU to the >> application-requested-by-syscall values and sigreturn restores it. > > So I don't mind that, as long as the whole "sigreturn restores it" is > part of things. > > Your original email with the suggestion to *not* resture PKRU I didn't > like. Setting it and restoring it is fine. > > I do wonder if you need an explicit value, though. I think it's > reasonable to say that PKRU value 0 is special. It's what we'd start > processes with, and why not just say that it's what we run signal > handlers in? > > Would any other value ever make sense, really? Having a PKRU with the execute-only permissions set is the only one I can think of. For a system with a _dedicated_ PKEY for execute-only, this is easy and could even be made a part of init_fpstate with no other code changes. But, if we are picking out an execute-only pkey more dynamically, we've got to keep the default value for the entire process somewhere. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/