Re: Real-Time Preemption and RCU
On Wed, Mar 23, 2005 at 12:44:52PM +0100, Esben Nielsen wrote: > On Tue, 22 Mar 2005, Paul E. McKenney wrote: > > > On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote: > > > On Mon, 21 Mar 2005, Paul E. McKenney wrote: > > [ . . . ] > > > > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: > > > > This is in some ways similar to the K42 approach to RCU (which they call > > > > "generations"). Dipankar put together a similar patch for Linux, but > > > > the problem was that grace periods could be deferred for an extremely > > > > long time. Which I suspect is what you were calling out as causing > > > > RCU batches never to run. > > > > > > That is where the preempt_by_nonrt_disable/enable() is supposed to help: > > > Then it can't take longer than the normal kernel in the situation where > > > there is no RT tasks running. RT tasks will prolong the grace periods if > > > they go into RCU regions, but they are supposed to be relatively small - > > > and deterministic! > > > > The part that I am missing is how this helps in the case where a non-RT > > task gets preempted in the middle of an RCU read-side critical section > > indefinitely. Or are you boosting the priority of any task that > > enters an RCU read-side critical section? > > Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am > playing around (when I get time for it that is :-( ) with cheaper > solution: > I assume you enter these regions where you don't want to be > preempted by non-RT tasks are relatively short. Therefore the risc of > getting preempted is small. Moving the priority is expensive since you > need to lock the runqueue. I only want to do the movement when > there is an preemption. Therefore I added code in schedule() to take care > of it: If a task is in a rcu-read section, is non-RT and is preempted it's > priority is set to MAX_RT_PRIO for the time being. It will keep that > priority until the priority is recalculated, but that shouldn't hurt > anyone. > I am not happy about adding code to schedule() but setting the > priority in there is very cheap because it already has the lock > on the runqueue. Furthermore, I assume it only happens very rarely. In the > execution of schedule() my code only takes a single test on wether the > previous task was in a rcu-section or not. That is not very much code. Interesting approach -- could come in handy. > I have not yet tested it (no time :-( ) Well, being as I haven't got the lock-based scheme fully running yet, I can't give you too much trouble about that. :-/ Thanx, Paul > > [...] > > > > Yes, but this is true of every other lock in the system as well, not? > > > > > > Other locks are not globaly used but only used for a specific subsystem. > > > On a real-time system you are supposed to know which subsystems you can > > > call into and still have a low enough latency as each subsystem has it's > > > own bound. But with a global RCU locking mechanism all RCU using code is > > > to be regarded as _one_ such subsystem. > > > > Yep. As would the things protected by the dcache lock, task list lock, > > and so on, right? > > Yep > > > > > Thanx, Paul > > > Esben > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Paul E. McKenney wrote: > On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote: > > On Mon, 21 Mar 2005, Paul E. McKenney wrote: > [ . . . ] > > > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: > > > This is in some ways similar to the K42 approach to RCU (which they call > > > "generations"). Dipankar put together a similar patch for Linux, but > > > the problem was that grace periods could be deferred for an extremely > > > long time. Which I suspect is what you were calling out as causing > > > RCU batches never to run. > > > > That is where the preempt_by_nonrt_disable/enable() is supposed to help: > > Then it can't take longer than the normal kernel in the situation where > > there is no RT tasks running. RT tasks will prolong the grace periods if > > they go into RCU regions, but they are supposed to be relatively small - > > and deterministic! > > The part that I am missing is how this helps in the case where a non-RT > task gets preempted in the middle of an RCU read-side critical section > indefinitely. Or are you boosting the priority of any task that > enters an RCU read-side critical section? Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am playing around (when I get time for it that is :-( ) with cheaper solution: I assume you enter these regions where you don't want to be preempted by non-RT tasks are relatively short. Therefore the risc of getting preempted is small. Moving the priority is expensive since you need to lock the runqueue. I only want to do the movement when there is an preemption. Therefore I added code in schedule() to take care of it: If a task is in a rcu-read section, is non-RT and is preempted it's priority is set to MAX_RT_PRIO for the time being. It will keep that priority until the priority is recalculated, but that shouldn't hurt anyone. I am not happy about adding code to schedule() but setting the priority in there is very cheap because it already has the lock on the runqueue. Furthermore, I assume it only happens very rarely. In the execution of schedule() my code only takes a single test on wether the previous task was in a rcu-section or not. That is not very much code. I have not yet tested it (no time :-( ) > [...] > > > Yes, but this is true of every other lock in the system as well, not? > > > > Other locks are not globaly used but only used for a specific subsystem. > > On a real-time system you are supposed to know which subsystems you can > > call into and still have a low enough latency as each subsystem has it's > > own bound. But with a global RCU locking mechanism all RCU using code is > > to be regarded as _one_ such subsystem. > > Yep. As would the things protected by the dcache lock, task list lock, > and so on, right? Yep > > Thanx, Paul > Esben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Paul E. McKenney wrote: On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote: On Mon, 21 Mar 2005, Paul E. McKenney wrote: [ . . . ] On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: This is in some ways similar to the K42 approach to RCU (which they call generations). Dipankar put together a similar patch for Linux, but the problem was that grace periods could be deferred for an extremely long time. Which I suspect is what you were calling out as causing RCU batches never to run. That is where the preempt_by_nonrt_disable/enable() is supposed to help: Then it can't take longer than the normal kernel in the situation where there is no RT tasks running. RT tasks will prolong the grace periods if they go into RCU regions, but they are supposed to be relatively small - and deterministic! The part that I am missing is how this helps in the case where a non-RT task gets preempted in the middle of an RCU read-side critical section indefinitely. Or are you boosting the priority of any task that enters an RCU read-side critical section? Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am playing around (when I get time for it that is :-( ) with cheaper solution: I assume you enter these regions where you don't want to be preempted by non-RT tasks are relatively short. Therefore the risc of getting preempted is small. Moving the priority is expensive since you need to lock the runqueue. I only want to do the movement when there is an preemption. Therefore I added code in schedule() to take care of it: If a task is in a rcu-read section, is non-RT and is preempted it's priority is set to MAX_RT_PRIO for the time being. It will keep that priority until the priority is recalculated, but that shouldn't hurt anyone. I am not happy about adding code to schedule() but setting the priority in there is very cheap because it already has the lock on the runqueue. Furthermore, I assume it only happens very rarely. In the execution of schedule() my code only takes a single test on wether the previous task was in a rcu-section or not. That is not very much code. I have not yet tested it (no time :-( ) [...] Yes, but this is true of every other lock in the system as well, not? Other locks are not globaly used but only used for a specific subsystem. On a real-time system you are supposed to know which subsystems you can call into and still have a low enough latency as each subsystem has it's own bound. But with a global RCU locking mechanism all RCU using code is to be regarded as _one_ such subsystem. Yep. As would the things protected by the dcache lock, task list lock, and so on, right? Yep Thanx, Paul Esben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Wed, Mar 23, 2005 at 12:44:52PM +0100, Esben Nielsen wrote: On Tue, 22 Mar 2005, Paul E. McKenney wrote: On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote: On Mon, 21 Mar 2005, Paul E. McKenney wrote: [ . . . ] On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: This is in some ways similar to the K42 approach to RCU (which they call generations). Dipankar put together a similar patch for Linux, but the problem was that grace periods could be deferred for an extremely long time. Which I suspect is what you were calling out as causing RCU batches never to run. That is where the preempt_by_nonrt_disable/enable() is supposed to help: Then it can't take longer than the normal kernel in the situation where there is no RT tasks running. RT tasks will prolong the grace periods if they go into RCU regions, but they are supposed to be relatively small - and deterministic! The part that I am missing is how this helps in the case where a non-RT task gets preempted in the middle of an RCU read-side critical section indefinitely. Or are you boosting the priority of any task that enters an RCU read-side critical section? Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am playing around (when I get time for it that is :-( ) with cheaper solution: I assume you enter these regions where you don't want to be preempted by non-RT tasks are relatively short. Therefore the risc of getting preempted is small. Moving the priority is expensive since you need to lock the runqueue. I only want to do the movement when there is an preemption. Therefore I added code in schedule() to take care of it: If a task is in a rcu-read section, is non-RT and is preempted it's priority is set to MAX_RT_PRIO for the time being. It will keep that priority until the priority is recalculated, but that shouldn't hurt anyone. I am not happy about adding code to schedule() but setting the priority in there is very cheap because it already has the lock on the runqueue. Furthermore, I assume it only happens very rarely. In the execution of schedule() my code only takes a single test on wether the previous task was in a rcu-section or not. That is not very much code. Interesting approach -- could come in handy. I have not yet tested it (no time :-( ) Well, being as I haven't got the lock-based scheme fully running yet, I can't give you too much trouble about that. :-/ Thanx, Paul [...] Yes, but this is true of every other lock in the system as well, not? Other locks are not globaly used but only used for a specific subsystem. On a real-time system you are supposed to know which subsystems you can call into and still have a low enough latency as each subsystem has it's own bound. But with a global RCU locking mechanism all RCU using code is to be regarded as _one_ such subsystem. Yep. As would the things protected by the dcache lock, task list lock, and so on, right? Yep Thanx, Paul Esben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote: > On Mon, 21 Mar 2005, Paul E. McKenney wrote: [ . . . ] > > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: > > This is in some ways similar to the K42 approach to RCU (which they call > > "generations"). Dipankar put together a similar patch for Linux, but > > the problem was that grace periods could be deferred for an extremely > > long time. Which I suspect is what you were calling out as causing > > RCU batches never to run. > > That is where the preempt_by_nonrt_disable/enable() is supposed to help: > Then it can't take longer than the normal kernel in the situation where > there is no RT tasks running. RT tasks will prolong the grace periods if > they go into RCU regions, but they are supposed to be relatively small - > and deterministic! The part that I am missing is how this helps in the case where a non-RT task gets preempted in the middle of an RCU read-side critical section indefinitely. Or are you boosting the priority of any task that enters an RCU read-side critical section? > > > > The counter approach might work, and is also what the implementation #5 > > > > does -- check out rcu_read_lock() in Ingo's most recent patch. > > > > > > > > > > Do you refer to your original mail with implementing it in 5 steps? > > > In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that > > > forces syncronization between the CPUs - bad for scaling. It does make the > > > RCU batches somewhat deterministic, as the RCU task can boost the readers > > > to the rcu-task's priority. > > > The problem about this approach is that everybody calling into RCU code > > > have a worst case behaviour of the systemwide worst case RCU reader > > > section - which can be pretty large (in principle infinite if somebody.) > > > So if somebody uses a call to a function in the code containing a RCU read > > > area the worst case behavious would be the same as teh worst case latency > > > in the simple world where preempt_disable()/preempt_enable() was used. > > > > I missed something here -- readers would see the worst-case -writer- > > latency rather than the worst-case -reader- latency, right? Or are you > > concerned about the case where some reader blocks the write-side > > acquisitions in _synchronize_kernel(), but not before the writer has > > grabbed a lock, blocking any readers on the corresponding CPU? > > I am conserned that readers block each other, too. You do need an rw-mutex > allowing an unlimited number of readers for doing this. With the current > rw-mutex the readers block each other. I.e. the worst case latency is the > worst case reader latency - globally! > On the other hand with a rw-lock being unlimited - and thus do not keep > track of it readers - the readers can't be boosted by the writer. Then you > are back to square 1: The grace period can take a very long time. OK, good point. > > Yes, but this is true of every other lock in the system as well, not? > > Other locks are not globaly used but only used for a specific subsystem. > On a real-time system you are supposed to know which subsystems you can > call into and still have a low enough latency as each subsystem has it's > own bound. But with a global RCU locking mechanism all RCU using code is > to be regarded as _one_ such subsystem. Yep. As would the things protected by the dcache lock, task list lock, and so on, right? Thanx, Paul > > In a separate conversation a few weeks ago, Steve Hemminger suggested > > placing checks in long RCU read-side critical sections -- this could > > be used to keep the worst-case reader latency within the desired bounds. > > Not pretty, but, then again, bounding lock latency will require a fair > > number of similar changes, I suspect. > > > > Thanx, Paul > > > > Esben > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Ingo Molnar wrote: > > * Esben Nielsen <[EMAIL PROTECTED]> wrote: > > > > > +static inline void rcu_read_lock(void) > > > > +{ > > > > + preempt_disable(); > > > > + __get_cpu_var(rcu_data).active_readers++; > > > > + preempt_enable(); > > > > +} > > > > > > this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on > > > the same CPU, and hence ->active_readers can get out of sync. > > > > > > > Ok, this have to be handled in the mitigration code somehow. I have already > > added an > > current->rcu_read_depth++ > > so it ought to be painless. A simple solution would be not to > > mititagrate threads with rcu_read_depth!=0. > > could you elaborate? > Put an rcu_read_depth on each task. In rcu_read_lock() make a current->rcu_read_depth++; and visa versa in rcu_read_unlock(). In can_migrate_task() add if(p->rcu_read_depth) return 0; That might do the trick. > In any case, see the new PREEMPT_RCU code in the -40-07 patch (and > upwards). I've also attached a separate patch, it should apply cleanly > to 2.6.12-rc1. > I barely have time to download at the patches - let alone applying them! Anyway: I find one thing I don't like: using atomic_inc()/dec() in rcu_read_lock()/unlock() to touch rcu_data which might be on another CPU. Then rcu_data is not really per-CPU data anymore and it also hurts performance of RCU readers. I think it will be cheaper to use the above rcu_read_depth and then either not migrate tasks at all or make the migrate code take care of migrating the rcu_read_depth count to the new CPU - one would have to take care to increment it in the rcu_data of the new CPU on the new CPU (it isn't atomic) and then decrement it in the rcu_data of the old CPU on the old CPU - in that order. > Ingo > Esben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Esben Nielsen <[EMAIL PROTECTED]> wrote: > > > +static inline void rcu_read_lock(void) > > > +{ > > > + preempt_disable(); > > > + __get_cpu_var(rcu_data).active_readers++; > > > + preempt_enable(); > > > +} > > > > this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on > > the same CPU, and hence ->active_readers can get out of sync. > > > > Ok, this have to be handled in the mitigration code somehow. I have already > added an > current->rcu_read_depth++ > so it ought to be painless. A simple solution would be not to > mititagrate threads with rcu_read_depth!=0. could you elaborate? In any case, see the new PREEMPT_RCU code in the -40-07 patch (and upwards). I've also attached a separate patch, it should apply cleanly to 2.6.12-rc1. Ingo --- linux/kernel/rcupdate.c.orig +++ linux/kernel/rcupdate.c @@ -468,3 +468,36 @@ module_param(maxbatch, int, 0); EXPORT_SYMBOL(call_rcu); EXPORT_SYMBOL(call_rcu_bh); EXPORT_SYMBOL(synchronize_kernel); + +#ifdef CONFIG_PREEMPT_RCU + +void rcu_read_lock(void) +{ + if (current->rcu_read_lock_nesting++ == 0) { + current->rcu_data = _cpu_var(rcu_data); + atomic_inc(>rcu_data->active_readers); + put_cpu_var(rcu_data); + } +} +EXPORT_SYMBOL(rcu_read_lock); + +void rcu_read_unlock(void) +{ + int cpu; + + if (--current->rcu_read_lock_nesting == 0) { + atomic_dec(>rcu_data->active_readers); + /* +* Check whether we have reached quiescent state. +* Note! This is only for the local CPU, not for +* current->rcu_data's CPU [which typically is the +* current CPU, but may also be another CPU]. +*/ + cpu = get_cpu(); + rcu_qsctr_inc(cpu); + put_cpu(); + } +} +EXPORT_SYMBOL(rcu_read_unlock); + +#endif --- linux/include/linux/rcupdate.h.orig +++ linux/include/linux/rcupdate.h @@ -99,6 +99,9 @@ struct rcu_data { struct rcu_head *donelist; struct rcu_head **donetail; int cpu; +#ifdef CONFIG_PREEMPT_RCU + atomic_tactive_readers; +#endif }; DECLARE_PER_CPU(struct rcu_data, rcu_data); @@ -115,12 +118,18 @@ extern struct rcu_ctrlblk rcu_bh_ctrlblk static inline void rcu_qsctr_inc(int cpu) { struct rcu_data *rdp = _cpu(rcu_data, cpu); - rdp->passed_quiesc = 1; +#ifdef CONFIG_PREEMPT_RCU + if (!atomic_read(>active_readers)) +#endif + rdp->passed_quiesc = 1; } static inline void rcu_bh_qsctr_inc(int cpu) { struct rcu_data *rdp = _cpu(rcu_bh_data, cpu); - rdp->passed_quiesc = 1; +#ifdef CONFIG_PREEMPT_RCU + if (!atomic_read(>active_readers)) +#endif + rdp->passed_quiesc = 1; } static inline int __rcu_pending(struct rcu_ctrlblk *rcp, @@ -183,14 +192,22 @@ static inline int rcu_pending(int cpu) * * It is illegal to block while in an RCU read-side critical section. */ -#define rcu_read_lock()preempt_disable() +#ifdef CONFIG_PREEMPT_RCU + extern void rcu_read_lock(void); +#else +# define rcu_read_lock preempt_disable() +#endif /** * rcu_read_unlock - marks the end of an RCU read-side critical section. * * See rcu_read_lock() for more information. */ -#define rcu_read_unlock() preempt_enable() +#ifdef CONFIG_PREEMPT_RCU + extern void rcu_read_unlock(void); +#else +# define rcu_read_unlock preempt_enable() +#endif /* * So where is rcu_write_lock()? It does not exist, as there is no @@ -213,14 +230,16 @@ static inline int rcu_pending(int cpu) * can use just rcu_read_lock(). * */ -#define rcu_read_lock_bh() local_bh_disable() +//#define rcu_read_lock_bh() local_bh_disable() +#define rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); } /* * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section * * See rcu_read_lock_bh() for more information. */ -#define rcu_read_unlock_bh() local_bh_enable() +//#define rcu_read_unlock_bh() local_bh_enable() +#define rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); } /** * rcu_dereference - fetch an RCU-protected pointer in an --- linux/include/linux/sched.h.orig +++ linux/include/linux/sched.h @@ -727,6 +872,10 @@ struct task_struct { nodemask_t mems_allowed; int cpuset_mems_generation; #endif +#ifdef CONFIG_PREEMPT_RCU + int rcu_read_lock_nesting; + struct rcu_data *rcu_data; +#endif }; static inline pid_t process_group(struct task_struct *tsk)
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Ingo Molnar wrote: > > * Esben Nielsen <[EMAIL PROTECTED]> wrote: > > > +static inline void rcu_read_lock(void) > > +{ > > + preempt_disable(); > > + __get_cpu_var(rcu_data).active_readers++; > > + preempt_enable(); > > +} > > this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on > the same CPU, and hence ->active_readers can get out of sync. > Ok, this have to be handled in the mitigration code somehow. I have already added an current->rcu_read_depth++ so it ought to be painless. A simple solution would be not to mititagrate threads with rcu_read_depth!=0. > Ingo > Esben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Esben Nielsen <[EMAIL PROTECTED]> wrote: > +static inline void rcu_read_lock(void) > +{ > + preempt_disable(); > + __get_cpu_var(rcu_data).active_readers++; > + preempt_enable(); > +} this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on the same CPU, and hence ->active_readers can get out of sync. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Bill Huey wrote: > On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote: > > On Fri, 18 Mar 2005, Ingo Molnar wrote: > > > i really have no intention to allow multiple readers for rt-mutexes. We > > > got away with that so far, and i'd like to keep it so. Imagine 100 > > > threads all blocked in the same critical section (holding the read-lock) > > > when a highprio writer thread comes around: instant 100x latency to let > > > all of them roll forward. The only sane solution is to not allow > > > excessive concurrency. (That limits SMP scalability, but there's no > > > other choice i can see.) > > > > Unless a design change is made: One could argue for a semantics where > > write-locking _isn't_ deterministic and thus do not have to boost all the > > RCU isn't write deterministic like typical RT apps are we can... (below :-)) It is: It takes place right away. But it is not non-deterministic when _all_ readers actually read it. Also the cleanup is non-deterministic. So unless you actually _wait_ for the cleanup to happen instead of defering it you can safely do RCU writes in a RT-task. > > > readers. Readers boost the writers but not the other way around. Readers > > will be deterministic, but not writers. > > Such a semantics would probably work for a lot of RT applications > > happening not to take any write-locks - these will in fact perform better. > > But it will give the rest a lot of problems. > > Just came up with an idea after I thought about how much of a bitch it > would be to get a fast RCU multipule reader semantic (our current shared- > exclusive lock inserts owners into a sorted priority list per-thread which > makes it very expensive for a simple RCU case since they are typically very > small batches of items being altered). Basically the RCU algorithm has *no* > notion of writer priority and to propagate a PI operation down all reader > is meaningless, so why not revert back to the original rwlock-semaphore to > get the multipule reader semantics ? Remember to boost the writer such RT tasks can enter read regions. I must also warn against the dangers: A lot of code where a write-lock is taken need to marked as non-deterministic, i.e. must not-be called from RT-tasks (maybe put a WARN_ON(rt_task(current)) in the write-lock operation?) > > A notion of priority across a quiescience operation is crazy anyways, so > it would be safe just to use to the old rwlock-semaphore "in place" without > any changes or priorty handling addtions. The RCU algorithm is only concerned > with what is basically a coarse data guard and it isn't time or priority > critical. I don't find it crazy. I think it is elegant - but also dangerous as it might take a long time. > > What do you folks think ? That would make Paul's stuff respect multipule > readers which reduces contention and gets around the problem of possibly > overloading the current rt lock implementation that we've been bitching > about. The current RCU development track seem wrong in the first place and > this seem like it could be a better and more complete solution to the problem. > > If this works, well, you heard it here first. :) > > bill > Esben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, Mar 22, 2005 at 02:17:27AM -0800, Bill Huey wrote: > > A notion of priority across a quiescience operation is crazy anyways[-,-] so > > it would be safe just to use to the old rwlock-semaphore "in place" without > > any changes or priorty handling add[i]tions. The RCU algorithm is only > > concerned > > with what is basically a coarse data guard and it isn't time or priority > > critical. > > A little jitter in a quiescence operation isn't going to hurt things right ?. The only thing that I can think of that can go wrong here is what kind of effect it would have on the thread write blocking against a bunch of RCU readers. It could introduce a chain of delays into, say, a timer event and might cause problems/side-effects for other things being processed. RCU processing might have to decoupled processed by a different thread to avoid some of that latency weirdness. What do you folks think ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Ingo Molnar wrote: > > * Esben Nielsen <[EMAIL PROTECTED]> wrote: > > > On the other hand with a rw-lock being unlimited - and thus do not > > keep track of it readers - the readers can't be boosted by the writer. > > Then you are back to square 1: The grace period can take a very long > > time. > > btw., is the 'very long grace period' a real issue? We could avoid all > the RCU read-side locking latencies by making it truly unlocked and just > living with the long grace periods. Perhaps it's enough to add an > emergency mechanism to the OOM handler, which frees up all the 'blocked > by preemption' RCU callbacks via some scheduler magic. (e.g. such an > emergency mechanism could be _conditional_ locking on the read side - > i.e. new RCU read-side users would be blocked until the OOM situation > goes away, or something like that.) You wont catch RCU read-sides already entered - see below. > > your patch is implementing just that, correct? Would you mind redoing it > against a recent -RT base? (-40-04 or so) > In fact I am working on clean 2.6.12-rc1 right now. I figured this is orthorgonal to the rest RT patch and can probably go upstream immediately. Seemed to work. I'll try to make into a clean patch soonish and also try it on -40-04. I am trying to make a boosting mechanism in the scheduler such that RCU readers are boosted to MAX_RT_PRIO when preempted. I have to take it out first. Any specific tests I have to run? I am considering making an RCU test device. > also, what would be the worst-case workload causing long grace periods? A nice 19 task, A, enter an RCU region and is preempted. A lot of other tasks starts running. Then task A might starved for _minuttes_ such that there is no RCU-grace periods in all that time. > > Ingo Esben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, Mar 22, 2005 at 02:04:46AM -0800, Bill Huey wrote: > RCU isn't write deterministic like typical RT apps are[, so] we can... (below > :-)) ... > Just came up with an idea after I thought about how much of a bitch it > would be to get a fast RCU multipule reader semantic (our current shared- > exclusive lock inserts owners into a sorted priority list per-thread which > makes it very expensive for a simple RCU case[,] since they are typically very > small batches of items being altered). Basically the RCU algorithm has *no* > notion of writer priority and to propagate a PI operation down all reader[s] > is meaningless, so why not revert back to the original rwlock-semaphore to > get the multipule reader semantics ? The original lock, for those that don't know, doesn't strictly track read owners so reentrancy is cheap. > A notion of priority across a quiescience operation is crazy anyways[-,-] so > it would be safe just to use to the old rwlock-semaphore "in place" without > any changes or priorty handling add[i]tions. The RCU algorithm is only > concerned > with what is basically a coarse data guard and it isn't time or priority > critical. A little jitter in a quiescence operation isn't going to hurt things right ?. > What do you folks think ? That would make Paul's stuff respect multipule > readers which reduces contention and gets around the problem of possibly > overloading the current rt lock implementation that we've been bitching > about. The current RCU development track seem wrong in the first place and > this seem like it could be a better and more complete solution to the problem. Got to get rid of those typos :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote: > On Fri, 18 Mar 2005, Ingo Molnar wrote: > > i really have no intention to allow multiple readers for rt-mutexes. We > > got away with that so far, and i'd like to keep it so. Imagine 100 > > threads all blocked in the same critical section (holding the read-lock) > > when a highprio writer thread comes around: instant 100x latency to let > > all of them roll forward. The only sane solution is to not allow > > excessive concurrency. (That limits SMP scalability, but there's no > > other choice i can see.) > > Unless a design change is made: One could argue for a semantics where > write-locking _isn't_ deterministic and thus do not have to boost all the RCU isn't write deterministic like typical RT apps are we can... (below :-)) > readers. Readers boost the writers but not the other way around. Readers > will be deterministic, but not writers. > Such a semantics would probably work for a lot of RT applications > happening not to take any write-locks - these will in fact perform better. > But it will give the rest a lot of problems. Just came up with an idea after I thought about how much of a bitch it would be to get a fast RCU multipule reader semantic (our current shared- exclusive lock inserts owners into a sorted priority list per-thread which makes it very expensive for a simple RCU case since they are typically very small batches of items being altered). Basically the RCU algorithm has *no* notion of writer priority and to propagate a PI operation down all reader is meaningless, so why not revert back to the original rwlock-semaphore to get the multipule reader semantics ? A notion of priority across a quiescience operation is crazy anyways, so it would be safe just to use to the old rwlock-semaphore "in place" without any changes or priorty handling addtions. The RCU algorithm is only concerned with what is basically a coarse data guard and it isn't time or priority critical. What do you folks think ? That would make Paul's stuff respect multipule readers which reduces contention and gets around the problem of possibly overloading the current rt lock implementation that we've been bitching about. The current RCU development track seem wrong in the first place and this seem like it could be a better and more complete solution to the problem. If this works, well, you heard it here first. :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Esben Nielsen <[EMAIL PROTECTED]> wrote: > On the other hand with a rw-lock being unlimited - and thus do not > keep track of it readers - the readers can't be boosted by the writer. > Then you are back to square 1: The grace period can take a very long > time. btw., is the 'very long grace period' a real issue? We could avoid all the RCU read-side locking latencies by making it truly unlocked and just living with the long grace periods. Perhaps it's enough to add an emergency mechanism to the OOM handler, which frees up all the 'blocked by preemption' RCU callbacks via some scheduler magic. (e.g. such an emergency mechanism could be _conditional_ locking on the read side - i.e. new RCU read-side users would be blocked until the OOM situation goes away, or something like that.) your patch is implementing just that, correct? Would you mind redoing it against a recent -RT base? (-40-04 or so) also, what would be the worst-case workload causing long grace periods? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Mon, 21 Mar 2005, Paul E. McKenney wrote: > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: > > > [...] > > Well, I was actually thinking of an API like > > preempt_by_nonrt_disable() > > preempt_by_nonrt_enable() > > working like the old preempt_disable()/preempt_enable() but still > > allowing RT tasks (as well as priority inheriting non-RT tasks) to be > > scheduled. That would kind of help "split" the kernel into two halfs: the > > RT part and the non-RT part. The non-RT part would in many ways work as it > > has always done. > > Does sound in some ways similar to the migration approach -- there, the > RT/non-RT split is made across CPUs. But if RT is allowed to preempt, > then you still have to deal with preemption for locking correctness, right? > Yes. It is not a locking mechanism, it should just prevent scheduling of "normal" userspace tasks. > > [..] > > Well, in my patch I do not leave preemption off - only while doing the > > simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader > > might be active, i.e. preempted, on the current CPU such that this isn't > > and quiescent point after all. > > (To others: Paul nicely unfolded my attachment below - I left it in > > the mail such you can read it.) > > The problem with this approach is ofcourse that user space programs might > > preempt an RCU reader for a very long time such that RCU batches are never > > really run. The boosting of non-RT tasks mentioned above would help a > > lot. > > A plus(?) in it: You can actually sleep while having the rcu_read_lock !! > > This is in some ways similar to the K42 approach to RCU (which they call > "generations"). Dipankar put together a similar patch for Linux, but > the problem was that grace periods could be deferred for an extremely > long time. Which I suspect is what you were calling out as causing > RCU batches never to run. > That is where the preempt_by_nonrt_disable/enable() is supposed to help: Then it can't take longer than the normal kernel in the situation where there is no RT tasks running. RT tasks will prolong the grace periods if they go into RCU regions, but they are supposed to be relatively small - and deterministic! > > > The counter approach might work, and is also what the implementation #5 > > > does -- check out rcu_read_lock() in Ingo's most recent patch. > > > > > > > Do you refer to your original mail with implementing it in 5 steps? > > In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that > > forces syncronization between the CPUs - bad for scaling. It does make the > > RCU batches somewhat deterministic, as the RCU task can boost the readers > > to the rcu-task's priority. > > The problem about this approach is that everybody calling into RCU code > > have a worst case behaviour of the systemwide worst case RCU reader > > section - which can be pretty large (in principle infinite if somebody.) > > So if somebody uses a call to a function in the code containing a RCU read > > area the worst case behavious would be the same as teh worst case latency > > in the simple world where preempt_disable()/preempt_enable() was used. > > I missed something here -- readers would see the worst-case -writer- > latency rather than the worst-case -reader- latency, right? Or are you > concerned about the case where some reader blocks the write-side > acquisitions in _synchronize_kernel(), but not before the writer has > grabbed a lock, blocking any readers on the corresponding CPU? > I am conserned that readers block each other, too. You do need an rw-mutex allowing an unlimited number of readers for doing this. With the current rw-mutex the readers block each other. I.e. the worst case latency is the worst case reader latency - globally! On the other hand with a rw-lock being unlimited - and thus do not keep track of it readers - the readers can't be boosted by the writer. Then you are back to square 1: The grace period can take a very long time. > Yes, but this is true of every other lock in the system as well, not? Other locks are not globaly used but only used for a specific subsystem. On a real-time system you are supposed to know which subsystems you can call into and still have a low enough latency as each subsystem has it's own bound. But with a global RCU locking mechanism all RCU using code is to be regarded as _one_ such subsystem. > In a separate conversation a few weeks ago, Steve Hemminger suggested > placing checks in long RCU read-side critical sections -- this could > be used to keep the worst-case reader latency within the desired bounds. > Not pretty, but, then again, bounding lock latency will require a fair > number of similar changes, I suspect. > > Thanx, Paul > Esben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: Real-Time Preemption and RCU
* Kyle Moffett <[EMAIL PROTECTED]> wrote: > One solution I can think of, although it bloats memory usage for > many-way boxen, is to just have a table in the rwlock with one entry > per cpu. Each CPU would get one concurrent reader, others would need > to sleep yes, it bloats memory usage, and complicates PI quite significantly. We've been there and have done something similar to that in earlier -RT patches - 64 'owner' entries in the lock already caused problems on 8K stacks. 32 seemed to work but caused quite some bloat in data structure sizes. i'd rather live with some scalability bottleneck for now, in exchange of a wastly easier to handle design. We can still complicate things later on for better scalability, once all the other issues have been sorted out. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Kyle Moffett [EMAIL PROTECTED] wrote: One solution I can think of, although it bloats memory usage for many-way boxen, is to just have a table in the rwlock with one entry per cpu. Each CPU would get one concurrent reader, others would need to sleep yes, it bloats memory usage, and complicates PI quite significantly. We've been there and have done something similar to that in earlier -RT patches - 64 'owner' entries in the lock already caused problems on 8K stacks. 32 seemed to work but caused quite some bloat in data structure sizes. i'd rather live with some scalability bottleneck for now, in exchange of a wastly easier to handle design. We can still complicate things later on for better scalability, once all the other issues have been sorted out. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Mon, 21 Mar 2005, Paul E. McKenney wrote: On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: [...] Well, I was actually thinking of an API like preempt_by_nonrt_disable() preempt_by_nonrt_enable() working like the old preempt_disable()/preempt_enable() but still allowing RT tasks (as well as priority inheriting non-RT tasks) to be scheduled. That would kind of help split the kernel into two halfs: the RT part and the non-RT part. The non-RT part would in many ways work as it has always done. Does sound in some ways similar to the migration approach -- there, the RT/non-RT split is made across CPUs. But if RT is allowed to preempt, then you still have to deal with preemption for locking correctness, right? Yes. It is not a locking mechanism, it should just prevent scheduling of normal userspace tasks. [..] Well, in my patch I do not leave preemption off - only while doing the simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader might be active, i.e. preempted, on the current CPU such that this isn't and quiescent point after all. (To others: Paul nicely unfolded my attachment below - I left it in the mail such you can read it.) The problem with this approach is ofcourse that user space programs might preempt an RCU reader for a very long time such that RCU batches are never really run. The boosting of non-RT tasks mentioned above would help a lot. A plus(?) in it: You can actually sleep while having the rcu_read_lock !! This is in some ways similar to the K42 approach to RCU (which they call generations). Dipankar put together a similar patch for Linux, but the problem was that grace periods could be deferred for an extremely long time. Which I suspect is what you were calling out as causing RCU batches never to run. That is where the preempt_by_nonrt_disable/enable() is supposed to help: Then it can't take longer than the normal kernel in the situation where there is no RT tasks running. RT tasks will prolong the grace periods if they go into RCU regions, but they are supposed to be relatively small - and deterministic! The counter approach might work, and is also what the implementation #5 does -- check out rcu_read_lock() in Ingo's most recent patch. Do you refer to your original mail with implementing it in 5 steps? In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that forces syncronization between the CPUs - bad for scaling. It does make the RCU batches somewhat deterministic, as the RCU task can boost the readers to the rcu-task's priority. The problem about this approach is that everybody calling into RCU code have a worst case behaviour of the systemwide worst case RCU reader section - which can be pretty large (in principle infinite if somebody.) So if somebody uses a call to a function in the code containing a RCU read area the worst case behavious would be the same as teh worst case latency in the simple world where preempt_disable()/preempt_enable() was used. I missed something here -- readers would see the worst-case -writer- latency rather than the worst-case -reader- latency, right? Or are you concerned about the case where some reader blocks the write-side acquisitions in _synchronize_kernel(), but not before the writer has grabbed a lock, blocking any readers on the corresponding CPU? I am conserned that readers block each other, too. You do need an rw-mutex allowing an unlimited number of readers for doing this. With the current rw-mutex the readers block each other. I.e. the worst case latency is the worst case reader latency - globally! On the other hand with a rw-lock being unlimited - and thus do not keep track of it readers - the readers can't be boosted by the writer. Then you are back to square 1: The grace period can take a very long time. Yes, but this is true of every other lock in the system as well, not? Other locks are not globaly used but only used for a specific subsystem. On a real-time system you are supposed to know which subsystems you can call into and still have a low enough latency as each subsystem has it's own bound. But with a global RCU locking mechanism all RCU using code is to be regarded as _one_ such subsystem. In a separate conversation a few weeks ago, Steve Hemminger suggested placing checks in long RCU read-side critical sections -- this could be used to keep the worst-case reader latency within the desired bounds. Not pretty, but, then again, bounding lock latency will require a fair number of similar changes, I suspect. Thanx, Paul Esben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Esben Nielsen [EMAIL PROTECTED] wrote: On the other hand with a rw-lock being unlimited - and thus do not keep track of it readers - the readers can't be boosted by the writer. Then you are back to square 1: The grace period can take a very long time. btw., is the 'very long grace period' a real issue? We could avoid all the RCU read-side locking latencies by making it truly unlocked and just living with the long grace periods. Perhaps it's enough to add an emergency mechanism to the OOM handler, which frees up all the 'blocked by preemption' RCU callbacks via some scheduler magic. (e.g. such an emergency mechanism could be _conditional_ locking on the read side - i.e. new RCU read-side users would be blocked until the OOM situation goes away, or something like that.) your patch is implementing just that, correct? Would you mind redoing it against a recent -RT base? (-40-04 or so) also, what would be the worst-case workload causing long grace periods? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote: On Fri, 18 Mar 2005, Ingo Molnar wrote: i really have no intention to allow multiple readers for rt-mutexes. We got away with that so far, and i'd like to keep it so. Imagine 100 threads all blocked in the same critical section (holding the read-lock) when a highprio writer thread comes around: instant 100x latency to let all of them roll forward. The only sane solution is to not allow excessive concurrency. (That limits SMP scalability, but there's no other choice i can see.) Unless a design change is made: One could argue for a semantics where write-locking _isn't_ deterministic and thus do not have to boost all the RCU isn't write deterministic like typical RT apps are we can... (below :-)) readers. Readers boost the writers but not the other way around. Readers will be deterministic, but not writers. Such a semantics would probably work for a lot of RT applications happening not to take any write-locks - these will in fact perform better. But it will give the rest a lot of problems. Just came up with an idea after I thought about how much of a bitch it would be to get a fast RCU multipule reader semantic (our current shared- exclusive lock inserts owners into a sorted priority list per-thread which makes it very expensive for a simple RCU case since they are typically very small batches of items being altered). Basically the RCU algorithm has *no* notion of writer priority and to propagate a PI operation down all reader is meaningless, so why not revert back to the original rwlock-semaphore to get the multipule reader semantics ? A notion of priority across a quiescience operation is crazy anyways, so it would be safe just to use to the old rwlock-semaphore in place without any changes or priorty handling addtions. The RCU algorithm is only concerned with what is basically a coarse data guard and it isn't time or priority critical. What do you folks think ? That would make Paul's stuff respect multipule readers which reduces contention and gets around the problem of possibly overloading the current rt lock implementation that we've been bitching about. The current RCU development track seem wrong in the first place and this seem like it could be a better and more complete solution to the problem. If this works, well, you heard it here first. :) bill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, Mar 22, 2005 at 02:04:46AM -0800, Bill Huey wrote: RCU isn't write deterministic like typical RT apps are[, so] we can... (below :-)) ... Just came up with an idea after I thought about how much of a bitch it would be to get a fast RCU multipule reader semantic (our current shared- exclusive lock inserts owners into a sorted priority list per-thread which makes it very expensive for a simple RCU case[,] since they are typically very small batches of items being altered). Basically the RCU algorithm has *no* notion of writer priority and to propagate a PI operation down all reader[s] is meaningless, so why not revert back to the original rwlock-semaphore to get the multipule reader semantics ? The original lock, for those that don't know, doesn't strictly track read owners so reentrancy is cheap. A notion of priority across a quiescience operation is crazy anyways[-,-] so it would be safe just to use to the old rwlock-semaphore in place without any changes or priorty handling add[i]tions. The RCU algorithm is only concerned with what is basically a coarse data guard and it isn't time or priority critical. A little jitter in a quiescence operation isn't going to hurt things right ?. What do you folks think ? That would make Paul's stuff respect multipule readers which reduces contention and gets around the problem of possibly overloading the current rt lock implementation that we've been bitching about. The current RCU development track seem wrong in the first place and this seem like it could be a better and more complete solution to the problem. Got to get rid of those typos :) bill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Ingo Molnar wrote: * Esben Nielsen [EMAIL PROTECTED] wrote: On the other hand with a rw-lock being unlimited - and thus do not keep track of it readers - the readers can't be boosted by the writer. Then you are back to square 1: The grace period can take a very long time. btw., is the 'very long grace period' a real issue? We could avoid all the RCU read-side locking latencies by making it truly unlocked and just living with the long grace periods. Perhaps it's enough to add an emergency mechanism to the OOM handler, which frees up all the 'blocked by preemption' RCU callbacks via some scheduler magic. (e.g. such an emergency mechanism could be _conditional_ locking on the read side - i.e. new RCU read-side users would be blocked until the OOM situation goes away, or something like that.) You wont catch RCU read-sides already entered - see below. your patch is implementing just that, correct? Would you mind redoing it against a recent -RT base? (-40-04 or so) In fact I am working on clean 2.6.12-rc1 right now. I figured this is orthorgonal to the rest RT patch and can probably go upstream immediately. Seemed to work. I'll try to make into a clean patch soonish and also try it on -40-04. I am trying to make a boosting mechanism in the scheduler such that RCU readers are boosted to MAX_RT_PRIO when preempted. I have to take it out first. Any specific tests I have to run? I am considering making an RCU test device. also, what would be the worst-case workload causing long grace periods? A nice 19 task, A, enter an RCU region and is preempted. A lot of other tasks starts running. Then task A might starved for _minuttes_ such that there is no RCU-grace periods in all that time. Ingo Esben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, Mar 22, 2005 at 02:17:27AM -0800, Bill Huey wrote: A notion of priority across a quiescience operation is crazy anyways[-,-] so it would be safe just to use to the old rwlock-semaphore in place without any changes or priorty handling add[i]tions. The RCU algorithm is only concerned with what is basically a coarse data guard and it isn't time or priority critical. A little jitter in a quiescence operation isn't going to hurt things right ?. The only thing that I can think of that can go wrong here is what kind of effect it would have on the thread write blocking against a bunch of RCU readers. It could introduce a chain of delays into, say, a timer event and might cause problems/side-effects for other things being processed. RCU processing might have to decoupled processed by a different thread to avoid some of that latency weirdness. What do you folks think ? bill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Bill Huey wrote: On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote: On Fri, 18 Mar 2005, Ingo Molnar wrote: i really have no intention to allow multiple readers for rt-mutexes. We got away with that so far, and i'd like to keep it so. Imagine 100 threads all blocked in the same critical section (holding the read-lock) when a highprio writer thread comes around: instant 100x latency to let all of them roll forward. The only sane solution is to not allow excessive concurrency. (That limits SMP scalability, but there's no other choice i can see.) Unless a design change is made: One could argue for a semantics where write-locking _isn't_ deterministic and thus do not have to boost all the RCU isn't write deterministic like typical RT apps are we can... (below :-)) It is: It takes place right away. But it is not non-deterministic when _all_ readers actually read it. Also the cleanup is non-deterministic. So unless you actually _wait_ for the cleanup to happen instead of defering it you can safely do RCU writes in a RT-task. readers. Readers boost the writers but not the other way around. Readers will be deterministic, but not writers. Such a semantics would probably work for a lot of RT applications happening not to take any write-locks - these will in fact perform better. But it will give the rest a lot of problems. Just came up with an idea after I thought about how much of a bitch it would be to get a fast RCU multipule reader semantic (our current shared- exclusive lock inserts owners into a sorted priority list per-thread which makes it very expensive for a simple RCU case since they are typically very small batches of items being altered). Basically the RCU algorithm has *no* notion of writer priority and to propagate a PI operation down all reader is meaningless, so why not revert back to the original rwlock-semaphore to get the multipule reader semantics ? Remember to boost the writer such RT tasks can enter read regions. I must also warn against the dangers: A lot of code where a write-lock is taken need to marked as non-deterministic, i.e. must not-be called from RT-tasks (maybe put a WARN_ON(rt_task(current)) in the write-lock operation?) A notion of priority across a quiescience operation is crazy anyways, so it would be safe just to use to the old rwlock-semaphore in place without any changes or priorty handling addtions. The RCU algorithm is only concerned with what is basically a coarse data guard and it isn't time or priority critical. I don't find it crazy. I think it is elegant - but also dangerous as it might take a long time. What do you folks think ? That would make Paul's stuff respect multipule readers which reduces contention and gets around the problem of possibly overloading the current rt lock implementation that we've been bitching about. The current RCU development track seem wrong in the first place and this seem like it could be a better and more complete solution to the problem. If this works, well, you heard it here first. :) bill Esben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Esben Nielsen [EMAIL PROTECTED] wrote: +static inline void rcu_read_lock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers++; + preempt_enable(); +} this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on the same CPU, and hence -active_readers can get out of sync. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Ingo Molnar wrote: * Esben Nielsen [EMAIL PROTECTED] wrote: +static inline void rcu_read_lock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers++; + preempt_enable(); +} this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on the same CPU, and hence -active_readers can get out of sync. Ok, this have to be handled in the mitigration code somehow. I have already added an current-rcu_read_depth++ so it ought to be painless. A simple solution would be not to mititagrate threads with rcu_read_depth!=0. Ingo Esben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Esben Nielsen [EMAIL PROTECTED] wrote: +static inline void rcu_read_lock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers++; + preempt_enable(); +} this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on the same CPU, and hence -active_readers can get out of sync. Ok, this have to be handled in the mitigration code somehow. I have already added an current-rcu_read_depth++ so it ought to be painless. A simple solution would be not to mititagrate threads with rcu_read_depth!=0. could you elaborate? In any case, see the new PREEMPT_RCU code in the -40-07 patch (and upwards). I've also attached a separate patch, it should apply cleanly to 2.6.12-rc1. Ingo --- linux/kernel/rcupdate.c.orig +++ linux/kernel/rcupdate.c @@ -468,3 +468,36 @@ module_param(maxbatch, int, 0); EXPORT_SYMBOL(call_rcu); EXPORT_SYMBOL(call_rcu_bh); EXPORT_SYMBOL(synchronize_kernel); + +#ifdef CONFIG_PREEMPT_RCU + +void rcu_read_lock(void) +{ + if (current-rcu_read_lock_nesting++ == 0) { + current-rcu_data = get_cpu_var(rcu_data); + atomic_inc(current-rcu_data-active_readers); + put_cpu_var(rcu_data); + } +} +EXPORT_SYMBOL(rcu_read_lock); + +void rcu_read_unlock(void) +{ + int cpu; + + if (--current-rcu_read_lock_nesting == 0) { + atomic_dec(current-rcu_data-active_readers); + /* +* Check whether we have reached quiescent state. +* Note! This is only for the local CPU, not for +* current-rcu_data's CPU [which typically is the +* current CPU, but may also be another CPU]. +*/ + cpu = get_cpu(); + rcu_qsctr_inc(cpu); + put_cpu(); + } +} +EXPORT_SYMBOL(rcu_read_unlock); + +#endif --- linux/include/linux/rcupdate.h.orig +++ linux/include/linux/rcupdate.h @@ -99,6 +99,9 @@ struct rcu_data { struct rcu_head *donelist; struct rcu_head **donetail; int cpu; +#ifdef CONFIG_PREEMPT_RCU + atomic_tactive_readers; +#endif }; DECLARE_PER_CPU(struct rcu_data, rcu_data); @@ -115,12 +118,18 @@ extern struct rcu_ctrlblk rcu_bh_ctrlblk static inline void rcu_qsctr_inc(int cpu) { struct rcu_data *rdp = per_cpu(rcu_data, cpu); - rdp-passed_quiesc = 1; +#ifdef CONFIG_PREEMPT_RCU + if (!atomic_read(rdp-active_readers)) +#endif + rdp-passed_quiesc = 1; } static inline void rcu_bh_qsctr_inc(int cpu) { struct rcu_data *rdp = per_cpu(rcu_bh_data, cpu); - rdp-passed_quiesc = 1; +#ifdef CONFIG_PREEMPT_RCU + if (!atomic_read(rdp-active_readers)) +#endif + rdp-passed_quiesc = 1; } static inline int __rcu_pending(struct rcu_ctrlblk *rcp, @@ -183,14 +192,22 @@ static inline int rcu_pending(int cpu) * * It is illegal to block while in an RCU read-side critical section. */ -#define rcu_read_lock()preempt_disable() +#ifdef CONFIG_PREEMPT_RCU + extern void rcu_read_lock(void); +#else +# define rcu_read_lock preempt_disable() +#endif /** * rcu_read_unlock - marks the end of an RCU read-side critical section. * * See rcu_read_lock() for more information. */ -#define rcu_read_unlock() preempt_enable() +#ifdef CONFIG_PREEMPT_RCU + extern void rcu_read_unlock(void); +#else +# define rcu_read_unlock preempt_enable() +#endif /* * So where is rcu_write_lock()? It does not exist, as there is no @@ -213,14 +230,16 @@ static inline int rcu_pending(int cpu) * can use just rcu_read_lock(). * */ -#define rcu_read_lock_bh() local_bh_disable() +//#define rcu_read_lock_bh() local_bh_disable() +#define rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); } /* * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section * * See rcu_read_lock_bh() for more information. */ -#define rcu_read_unlock_bh() local_bh_enable() +//#define rcu_read_unlock_bh() local_bh_enable() +#define rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); } /** * rcu_dereference - fetch an RCU-protected pointer in an --- linux/include/linux/sched.h.orig +++ linux/include/linux/sched.h @@ -727,6 +872,10 @@ struct task_struct { nodemask_t mems_allowed; int cpuset_mems_generation; #endif +#ifdef CONFIG_PREEMPT_RCU + int rcu_read_lock_nesting; + struct rcu_data *rcu_data; +#endif }; static inline pid_t process_group(struct task_struct *tsk)
Re: Real-Time Preemption and RCU
On Tue, 22 Mar 2005, Ingo Molnar wrote: * Esben Nielsen [EMAIL PROTECTED] wrote: +static inline void rcu_read_lock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers++; + preempt_enable(); +} this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on the same CPU, and hence -active_readers can get out of sync. Ok, this have to be handled in the mitigration code somehow. I have already added an current-rcu_read_depth++ so it ought to be painless. A simple solution would be not to mititagrate threads with rcu_read_depth!=0. could you elaborate? Put an rcu_read_depth on each task. In rcu_read_lock() make a current-rcu_read_depth++; and visa versa in rcu_read_unlock(). In can_migrate_task() add if(p-rcu_read_depth) return 0; That might do the trick. In any case, see the new PREEMPT_RCU code in the -40-07 patch (and upwards). I've also attached a separate patch, it should apply cleanly to 2.6.12-rc1. I barely have time to download at the patches - let alone applying them! Anyway: I find one thing I don't like: using atomic_inc()/dec() in rcu_read_lock()/unlock() to touch rcu_data which might be on another CPU. Then rcu_data is not really per-CPU data anymore and it also hurts performance of RCU readers. I think it will be cheaper to use the above rcu_read_depth and then either not migrate tasks at all or make the migrate code take care of migrating the rcu_read_depth count to the new CPU - one would have to take care to increment it in the rcu_data of the new CPU on the new CPU (it isn't atomic) and then decrement it in the rcu_data of the old CPU on the old CPU - in that order. Ingo Esben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote: On Mon, 21 Mar 2005, Paul E. McKenney wrote: [ . . . ] On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: This is in some ways similar to the K42 approach to RCU (which they call generations). Dipankar put together a similar patch for Linux, but the problem was that grace periods could be deferred for an extremely long time. Which I suspect is what you were calling out as causing RCU batches never to run. That is where the preempt_by_nonrt_disable/enable() is supposed to help: Then it can't take longer than the normal kernel in the situation where there is no RT tasks running. RT tasks will prolong the grace periods if they go into RCU regions, but they are supposed to be relatively small - and deterministic! The part that I am missing is how this helps in the case where a non-RT task gets preempted in the middle of an RCU read-side critical section indefinitely. Or are you boosting the priority of any task that enters an RCU read-side critical section? The counter approach might work, and is also what the implementation #5 does -- check out rcu_read_lock() in Ingo's most recent patch. Do you refer to your original mail with implementing it in 5 steps? In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that forces syncronization between the CPUs - bad for scaling. It does make the RCU batches somewhat deterministic, as the RCU task can boost the readers to the rcu-task's priority. The problem about this approach is that everybody calling into RCU code have a worst case behaviour of the systemwide worst case RCU reader section - which can be pretty large (in principle infinite if somebody.) So if somebody uses a call to a function in the code containing a RCU read area the worst case behavious would be the same as teh worst case latency in the simple world where preempt_disable()/preempt_enable() was used. I missed something here -- readers would see the worst-case -writer- latency rather than the worst-case -reader- latency, right? Or are you concerned about the case where some reader blocks the write-side acquisitions in _synchronize_kernel(), but not before the writer has grabbed a lock, blocking any readers on the corresponding CPU? I am conserned that readers block each other, too. You do need an rw-mutex allowing an unlimited number of readers for doing this. With the current rw-mutex the readers block each other. I.e. the worst case latency is the worst case reader latency - globally! On the other hand with a rw-lock being unlimited - and thus do not keep track of it readers - the readers can't be boosted by the writer. Then you are back to square 1: The grace period can take a very long time. OK, good point. Yes, but this is true of every other lock in the system as well, not? Other locks are not globaly used but only used for a specific subsystem. On a real-time system you are supposed to know which subsystems you can call into and still have a low enough latency as each subsystem has it's own bound. But with a global RCU locking mechanism all RCU using code is to be regarded as _one_ such subsystem. Yep. As would the things protected by the dcache lock, task list lock, and so on, right? Thanx, Paul In a separate conversation a few weeks ago, Steve Hemminger suggested placing checks in long RCU read-side critical sections -- this could be used to keep the worst-case reader latency within the desired bounds. Not pretty, but, then again, bounding lock latency will require a fair number of similar changes, I suspect. Thanx, Paul Esben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: > On Sun, 20 Mar 2005, Paul E. McKenney wrote: > > > On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote: > > > On Fri, 18 Mar 2005, Ingo Molnar wrote: > > > > > > > [...] > > > > > > I think it can be deterministic (on the long timescale of memory > > > management) > > > anyway: Boost any non-RT task entering an RCU region to the lowest RT > > > priority. > > > This way only all the RT tasks + one non-RT task can be within those > > > regions. The RT-tasks are supposed to have some kind of upper bound to > > > their CPU-usage. The non-RT task will also finish "soon" as it is > > > boosted. If the RCU batches are also at the lowest RT-priority they can be > > > run immediately after the non-RT task is done. > > > > Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-) > > > Well, I was actually thinking of an API like > preempt_by_nonrt_disable() > preempt_by_nonrt_enable() > working like the old preempt_disable()/preempt_enable() but still > allowing RT tasks (as well as priority inheriting non-RT tasks) to be > scheduled. That would kind of help "split" the kernel into two halfs: the > RT part and the non-RT part. The non-RT part would in many ways work as it > has always done. Does sound in some ways similar to the migration approach -- there, the RT/non-RT split is made across CPUs. But if RT is allowed to preempt, then you still have to deal with preemption for locking correctness, right? > > > > clearly the simplest solution is to go with the single-reader locks for > > > > now - a separate experiment could be done with a new type of rwlock that > > > > can only be used by the RCU code. (I'm not quite sure whether we could > > > > guarantee a minimum rate of RCU callback processing under such a scheme > > > > though. It's an eventual memory DoS otherwise.) > > > > > > > > > > Why are a lock needed at all? If it is doable without locking for an > > > non-preemptable SMP kernel it must be doable for an preemptable kernel as > > > well.I am convinced some kind of per-CPU rcu_read_count as I specified in > > > my previous mail can work some way or the other. call_rcu() might need to > > > do more complicated stuff and thus use CPU but call_rcu() is supposed to > > > be an relative rare event not worth optimizing for. Such an > > > implementation will work for any preemptable kernel, not only PREEMPT_RT. > > > For performance is considered it is important not to acquire any locks in > > > the rcu-read regions. > > > > You definitely don't need a lock -- you can just suppress preemption > > on the read side instead. But that potentially makes for long scheduling > > latencies. > > Well, in my patch I do not leave preemption off - only while doing the > simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader > might be active, i.e. preempted, on the current CPU such that this isn't > and quiescent point after all. > (To others: Paul nicely unfolded my attachment below - I left it in > the mail such you can read it.) > The problem with this approach is ofcourse that user space programs might > preempt an RCU reader for a very long time such that RCU batches are never > really run. The boosting of non-RT tasks mentioned above would help a > lot. > A plus(?) in it: You can actually sleep while having the rcu_read_lock !! This is in some ways similar to the K42 approach to RCU (which they call "generations"). Dipankar put together a similar patch for Linux, but the problem was that grace periods could be deferred for an extremely long time. Which I suspect is what you were calling out as causing RCU batches never to run. > > The counter approach might work, and is also what the implementation #5 > > does -- check out rcu_read_lock() in Ingo's most recent patch. > > > > Do you refer to your original mail with implementing it in 5 steps? > In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that > forces syncronization between the CPUs - bad for scaling. It does make the > RCU batches somewhat deterministic, as the RCU task can boost the readers > to the rcu-task's priority. > The problem about this approach is that everybody calling into RCU code > have a worst case behaviour of the systemwide worst case RCU reader > section - which can be pretty large (in principle infinite if somebody.) > So if somebody uses a call to a function in the code containing a RCU read > area the worst case behavious would be the same as teh worst case latency > in the simple world where preempt_disable()/preempt_enable() was used. I missed something here -- readers would see the worst-case -writer- latency rather than the worst-case -reader- latency, right? Or are you concerned about the case where some reader blocks the write-side acquisitions in _synchronize_kernel(), but not before the writer has grabbed a lock, blocking any readers on the corresponding CPU? Yes, but this is
Re: Real-Time Preemption and RCU
On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote: On Sun, 20 Mar 2005, Paul E. McKenney wrote: On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote: On Fri, 18 Mar 2005, Ingo Molnar wrote: [...] I think it can be deterministic (on the long timescale of memory management) anyway: Boost any non-RT task entering an RCU region to the lowest RT priority. This way only all the RT tasks + one non-RT task can be within those regions. The RT-tasks are supposed to have some kind of upper bound to their CPU-usage. The non-RT task will also finish soon as it is boosted. If the RCU batches are also at the lowest RT-priority they can be run immediately after the non-RT task is done. Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-) Well, I was actually thinking of an API like preempt_by_nonrt_disable() preempt_by_nonrt_enable() working like the old preempt_disable()/preempt_enable() but still allowing RT tasks (as well as priority inheriting non-RT tasks) to be scheduled. That would kind of help split the kernel into two halfs: the RT part and the non-RT part. The non-RT part would in many ways work as it has always done. Does sound in some ways similar to the migration approach -- there, the RT/non-RT split is made across CPUs. But if RT is allowed to preempt, then you still have to deal with preemption for locking correctness, right? clearly the simplest solution is to go with the single-reader locks for now - a separate experiment could be done with a new type of rwlock that can only be used by the RCU code. (I'm not quite sure whether we could guarantee a minimum rate of RCU callback processing under such a scheme though. It's an eventual memory DoS otherwise.) Why are a lock needed at all? If it is doable without locking for an non-preemptable SMP kernel it must be doable for an preemptable kernel as well.I am convinced some kind of per-CPU rcu_read_count as I specified in my previous mail can work some way or the other. call_rcu() might need to do more complicated stuff and thus use CPU but call_rcu() is supposed to be an relative rare event not worth optimizing for. Such an implementation will work for any preemptable kernel, not only PREEMPT_RT. For performance is considered it is important not to acquire any locks in the rcu-read regions. You definitely don't need a lock -- you can just suppress preemption on the read side instead. But that potentially makes for long scheduling latencies. Well, in my patch I do not leave preemption off - only while doing the simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader might be active, i.e. preempted, on the current CPU such that this isn't and quiescent point after all. (To others: Paul nicely unfolded my attachment below - I left it in the mail such you can read it.) The problem with this approach is ofcourse that user space programs might preempt an RCU reader for a very long time such that RCU batches are never really run. The boosting of non-RT tasks mentioned above would help a lot. A plus(?) in it: You can actually sleep while having the rcu_read_lock !! This is in some ways similar to the K42 approach to RCU (which they call generations). Dipankar put together a similar patch for Linux, but the problem was that grace periods could be deferred for an extremely long time. Which I suspect is what you were calling out as causing RCU batches never to run. The counter approach might work, and is also what the implementation #5 does -- check out rcu_read_lock() in Ingo's most recent patch. Do you refer to your original mail with implementing it in 5 steps? In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that forces syncronization between the CPUs - bad for scaling. It does make the RCU batches somewhat deterministic, as the RCU task can boost the readers to the rcu-task's priority. The problem about this approach is that everybody calling into RCU code have a worst case behaviour of the systemwide worst case RCU reader section - which can be pretty large (in principle infinite if somebody.) So if somebody uses a call to a function in the code containing a RCU read area the worst case behavious would be the same as teh worst case latency in the simple world where preempt_disable()/preempt_enable() was used. I missed something here -- readers would see the worst-case -writer- latency rather than the worst-case -reader- latency, right? Or are you concerned about the case where some reader blocks the write-side acquisitions in _synchronize_kernel(), but not before the writer has grabbed a lock, blocking any readers on the corresponding CPU? Yes, but this is true of every other lock in the system as well, not? In a separate conversation a few weeks ago, Steve Hemminger suggested placing checks in long RCU
Re: Real-Time Preemption and RCU
On Sun, 20 Mar 2005, Paul E. McKenney wrote: > On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote: > > On Fri, 18 Mar 2005, Ingo Molnar wrote: > > > > > [...] > > > > I think it can be deterministic (on the long timescale of memory > > management) > > anyway: Boost any non-RT task entering an RCU region to the lowest RT > > priority. > > This way only all the RT tasks + one non-RT task can be within those > > regions. The RT-tasks are supposed to have some kind of upper bound to > > their CPU-usage. The non-RT task will also finish "soon" as it is > > boosted. If the RCU batches are also at the lowest RT-priority they can be > > run immediately after the non-RT task is done. > > Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-) > Well, I was actually thinking of an API like preempt_by_nonrt_disable() preempt_by_nonrt_enable() working like the old preempt_disable()/preempt_enable() but still allowing RT tasks (as well as priority inheriting non-RT tasks) to be scheduled. That would kind of help "split" the kernel into two halfs: the RT part and the non-RT part. The non-RT part would in many ways work as it has always done. > > > clearly the simplest solution is to go with the single-reader locks for > > > now - a separate experiment could be done with a new type of rwlock that > > > can only be used by the RCU code. (I'm not quite sure whether we could > > > guarantee a minimum rate of RCU callback processing under such a scheme > > > though. It's an eventual memory DoS otherwise.) > > > > > > > Why are a lock needed at all? If it is doable without locking for an > > non-preemptable SMP kernel it must be doable for an preemptable kernel as > > well.I am convinced some kind of per-CPU rcu_read_count as I specified in > > my previous mail can work some way or the other. call_rcu() might need to > > do more complicated stuff and thus use CPU but call_rcu() is supposed to > > be an relative rare event not worth optimizing for. Such an > > implementation will work for any preemptable kernel, not only PREEMPT_RT. > > For performance is considered it is important not to acquire any locks in > > the rcu-read regions. > > You definitely don't need a lock -- you can just suppress preemption > on the read side instead. But that potentially makes for long scheduling > latencies. Well, in my patch I do not leave preemption off - only while doing the simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader might be active, i.e. preempted, on the current CPU such that this isn't and quiescent point after all. (To others: Paul nicely unfolded my attachment below - I left it in the mail such you can read it.) The problem with this approach is ofcourse that user space programs might preempt an RCU reader for a very long time such that RCU batches are never really run. The boosting of non-RT tasks mentioned above would help a lot. A plus(?) in it: You can actually sleep while having the rcu_read_lock !! > > The counter approach might work, and is also what the implementation #5 > does -- check out rcu_read_lock() in Ingo's most recent patch. > Do you refer to your original mail with implementing it in 5 steps? In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that forces syncronization between the CPUs - bad for scaling. It does make the RCU batches somewhat deterministic, as the RCU task can boost the readers to the rcu-task's priority. The problem about this approach is that everybody calling into RCU code have a worst case behaviour of the systemwide worst case RCU reader section - which can be pretty large (in principle infinite if somebody.) So if somebody uses a call to a function in the code containing a RCU read area the worst case behavious would be the same as teh worst case latency in the simple world where preempt_disable()/preempt_enable() was used. > Thanx, Paul > > > I tried this approach. My UP labtop did boot on it, but I haven't testet > > it further. I have included the very small patch as an attachment. > > > > > Ingo > > > > I have not yet looked at -V0.7.41-00... > > > > Esben > > > > > diff -Naur --exclude-from diff_exclude > > linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h > > linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h > > --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 2005-03-11 > > 23:40:13.0 +0100 > > +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h > > 2005-03-19 12:47:09.0 +0100 > > @@ -85,6 +85,7 @@ > > * curlist - current batch for which quiescent cycle started if any > > */ > > struct rcu_data { > > + longactive_readers; > > /* 1) quiescent state handling : */ > > longquiescbatch; /* Batch # for grace period */ > > int passed_quiesc; /* User-mode/idle loop etc. */ > > @@ -115,12 +116,14 @@ > > static inline void
Re: Real-Time Preemption and RCU
On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote: > On Fri, 18 Mar 2005, Ingo Molnar wrote: > > > > > * Esben Nielsen <[EMAIL PROTECTED]> wrote: > > > > > Why can should there only be one RCU-reader per CPU at each given > > > instance? Even on a real-time UP system it would be very helpfull to > > > have RCU areas to be enterable by several tasks as once. It would > > > perform better, both wrt. latencies and throughput: With the above > > > implementation an high priority task entering an RCU area will have to > > > boost the current RCU reader, make a task switch until that one > > > finishes and makes yet another task switch. to get back to the high > > > priority task. With an RCU implementation which can take n RCU readers > > > per CPU there is no such problem. > > > > correct, for RCU we could allow multiple readers per lock, because the > > 'blocking' side of RCU (callback processing) is never (supposed to be) > > in any latency path. > > > > except if someone wants to make RCU callback processing deterministic at > > some point. (e.g. for memory management reasons.) > > I think it can be deterministic (on the long timescale of memory management) > anyway: Boost any non-RT task entering an RCU region to the lowest RT > priority. > This way only all the RT tasks + one non-RT task can be within those > regions. The RT-tasks are supposed to have some kind of upper bound to > their CPU-usage. The non-RT task will also finish "soon" as it is > boosted. If the RCU batches are also at the lowest RT-priority they can be > run immediately after the non-RT task is done. Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-) > > clearly the simplest solution is to go with the single-reader locks for > > now - a separate experiment could be done with a new type of rwlock that > > can only be used by the RCU code. (I'm not quite sure whether we could > > guarantee a minimum rate of RCU callback processing under such a scheme > > though. It's an eventual memory DoS otherwise.) > > > > Why are a lock needed at all? If it is doable without locking for an > non-preemptable SMP kernel it must be doable for an preemptable kernel as > well.I am convinced some kind of per-CPU rcu_read_count as I specified in > my previous mail can work some way or the other. call_rcu() might need to > do more complicated stuff and thus use CPU but call_rcu() is supposed to > be an relative rare event not worth optimizing for. Such an > implementation will work for any preemptable kernel, not only PREEMPT_RT. > For performance is considered it is important not to acquire any locks in > the rcu-read regions. You definitely don't need a lock -- you can just suppress preemption on the read side instead. But that potentially makes for long scheduling latencies. The counter approach might work, and is also what the implementation #5 does -- check out rcu_read_lock() in Ingo's most recent patch. Thanx, Paul > I tried this approach. My UP labtop did boot on it, but I haven't testet > it further. I have included the very small patch as an attachment. > > > Ingo > > I have not yet looked at -V0.7.41-00... > > Esben > > diff -Naur --exclude-from diff_exclude > linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h > linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h > --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h2005-03-11 > 23:40:13.0 +0100 > +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h > 2005-03-19 12:47:09.0 +0100 > @@ -85,6 +85,7 @@ > * curlist - current batch for which quiescent cycle started if any > */ > struct rcu_data { > + longactive_readers; > /* 1) quiescent state handling : */ > longquiescbatch; /* Batch # for grace period */ > int passed_quiesc; /* User-mode/idle loop etc. */ > @@ -115,12 +116,14 @@ > static inline void rcu_qsctr_inc(int cpu) > { > struct rcu_data *rdp = _cpu(rcu_data, cpu); > - rdp->passed_quiesc = 1; > + if(rdp->active_readers==0) > + rdp->passed_quiesc = 1; > } > static inline void rcu_bh_qsctr_inc(int cpu) > { > struct rcu_data *rdp = _cpu(rcu_bh_data, cpu); > - rdp->passed_quiesc = 1; > + if(rdp->active_readers==0) > + rdp->passed_quiesc = 1; > } > > static inline int __rcu_pending(struct rcu_ctrlblk *rcp, > @@ -183,29 +186,27 @@ > * > * It is illegal to block while in an RCU read-side critical section. > */ > -#define rcu_read_lock() preempt_disable() > +static inline void rcu_read_lock(void) > +{ > + preempt_disable(); > + __get_cpu_var(rcu_data).active_readers++; > + preempt_enable(); > +} > > /** > * rcu_read_unlock - marks the end of an RCU read-side critical section. > * > * See rcu_read_lock() for more information. > */ > -#define rcu_read_unlock()
Re: Real-Time Preemption and RCU
On Sun, Mar 20, 2005 at 01:38:24PM -0800, Bill Huey wrote: > On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote: > > That was just one random example. > > Another one would be : > > > > drivers/chat/tty_io.c, __do_SAK() contains > >read_lock(_lock); > >task_lock(p); > > > > kernel/sys.c, sys_setrlimit contains > >task_lock(current->group_leader); > >read_lock(_lock); > > > > task_lock is a shorthand for spin_lock(>alloc_lock). If read_lock is > > a normal spinlock, then this is an A/B B/A deadlock. > > That code was already dubious in the first place just because it > contained that circularity. If you had a rwlock that block on an > upper read count maximum[,] a deadlock situation would trigger anyways, > say, upon a flood of threads trying to do that sequence of aquires. The RT patch uses the lock ordering "in place" and whatevery nasty situation was going on previously will be effectively under high load, which increases the chance of it being triggered. Removal of the read side semantic just increases load more so that those cases can trigger. I disagree with this approach and I have an alternate implementation here that restores it. It's only half tested and fairly meaningless until an extreme contention case is revealed with the current rt lock implementation. Numbers need to be gather to prove or disprove this conjecture. > I'd probably experiment with using the {spin,read,write}-trylock > logic and release the all locks contains in a sequence like that > on the failure to aquire any of the locks in the chain as an > initial fix. A longer term fix might be to break things up a bit > so that whatever ordering being done would have that circularity. Excuse me, ...would *not* have that circularity. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote: > That was just one random example. > Another one would be : > > drivers/chat/tty_io.c, __do_SAK() contains >read_lock(_lock); >task_lock(p); > > kernel/sys.c, sys_setrlimit contains >task_lock(current->group_leader); >read_lock(_lock); > > task_lock is a shorthand for spin_lock(>alloc_lock). If read_lock is > a normal spinlock, then this is an A/B B/A deadlock. That code was already dubious in the first place just because it contained that circularity. If you had a rwlock that block on an upper read count maximum a deadlock situation would trigger anyways, say, upon a flood of threads trying to do that sequence of aquires. I'd probably experiment with using the {spin,read,write}-trylock logic and release the all locks contains in a sequence like that on the failure to aquire any of the locks in the chain as an initial fix. A longer term fix might be to break things up a bit so that whatever ordering being done would have that circularity. BTW, the runtime lock cricularity detector was designed to trigger on that situtation anyways. That's my thoughts on the matter. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
Thomas Gleixner wrote: On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote: cpu 1: acquire random networking spin_lock_bh() cpu 2: read_lock(_lock) from process context interrupt. softirq. within softirq: try to acquire the networking lock. * spins. cpu 1: hardware interrupt within hw interrupt: signal delivery. tries to acquire tasklist_lock. --> deadlock. Signal delivery from hw interrupt context (interrupt is flagged SA_NODELAY) is not possible in RT preemption mode. The local_irq_save_nort() check in __cache_alloc will catch you. That was just one random example. Another one would be : drivers/chat/tty_io.c, __do_SAK() contains read_lock(_lock); task_lock(p); kernel/sys.c, sys_setrlimit contains task_lock(current->group_leader); read_lock(_lock); task_lock is a shorthand for spin_lock(>alloc_lock). If read_lock is a normal spinlock, then this is an A/B B/A deadlock. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, 18 Mar 2005, Ingo Molnar wrote: > > * Esben Nielsen <[EMAIL PROTECTED]> wrote: > > > Why can should there only be one RCU-reader per CPU at each given > > instance? Even on a real-time UP system it would be very helpfull to > > have RCU areas to be enterable by several tasks as once. It would > > perform better, both wrt. latencies and throughput: With the above > > implementation an high priority task entering an RCU area will have to > > boost the current RCU reader, make a task switch until that one > > finishes and makes yet another task switch. to get back to the high > > priority task. With an RCU implementation which can take n RCU readers > > per CPU there is no such problem. > > correct, for RCU we could allow multiple readers per lock, because the > 'blocking' side of RCU (callback processing) is never (supposed to be) > in any latency path. > > except if someone wants to make RCU callback processing deterministic at > some point. (e.g. for memory management reasons.) I think it can be deterministic (on the long timescale of memory management) anyway: Boost any non-RT task entering an RCU region to the lowest RT priority. This way only all the RT tasks + one non-RT task can be within those regions. The RT-tasks are supposed to have some kind of upper bound to their CPU-usage. The non-RT task will also finish "soon" as it is boosted. If the RCU batches are also at the lowest RT-priority they can be run immediately after the non-RT task is done. > > clearly the simplest solution is to go with the single-reader locks for > now - a separate experiment could be done with a new type of rwlock that > can only be used by the RCU code. (I'm not quite sure whether we could > guarantee a minimum rate of RCU callback processing under such a scheme > though. It's an eventual memory DoS otherwise.) > Why are a lock needed at all? If it is doable without locking for an non-preemptable SMP kernel it must be doable for an preemptable kernel as well.I am convinced some kind of per-CPU rcu_read_count as I specified in my previous mail can work some way or the other. call_rcu() might need to do more complicated stuff and thus use CPU but call_rcu() is supposed to be an relative rare event not worth optimizing for. Such an implementation will work for any preemptable kernel, not only PREEMPT_RT. For performance is considered it is important not to acquire any locks in the rcu-read regions. I tried this approach. My UP labtop did boot on it, but I haven't testet it further. I have included the very small patch as an attachment. > Ingo I have not yet looked at -V0.7.41-00... Esben diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 2005-03-11 23:40:13.0 +0100 +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h 2005-03-19 12:47:09.0 +0100 @@ -85,6 +85,7 @@ * curlist - current batch for which quiescent cycle started if any */ struct rcu_data { + longactive_readers; /* 1) quiescent state handling : */ longquiescbatch; /* Batch # for grace period */ int passed_quiesc; /* User-mode/idle loop etc. */ @@ -115,12 +116,14 @@ static inline void rcu_qsctr_inc(int cpu) { struct rcu_data *rdp = _cpu(rcu_data, cpu); - rdp->passed_quiesc = 1; + if(rdp->active_readers==0) + rdp->passed_quiesc = 1; } static inline void rcu_bh_qsctr_inc(int cpu) { struct rcu_data *rdp = _cpu(rcu_bh_data, cpu); - rdp->passed_quiesc = 1; + if(rdp->active_readers==0) + rdp->passed_quiesc = 1; } static inline int __rcu_pending(struct rcu_ctrlblk *rcp, @@ -183,29 +186,27 @@ * * It is illegal to block while in an RCU read-side critical section. */ -#define rcu_read_lock()preempt_disable() +static inline void rcu_read_lock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers++; + preempt_enable(); +} /** * rcu_read_unlock - marks the end of an RCU read-side critical section. * * See rcu_read_lock() for more information. */ -#define rcu_read_unlock() preempt_enable() +static inline void rcu_read_unlock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers--; + preempt_enable(); +} #define IGNORE_LOCK(op, lock) do { (void)(lock); op(); } while (0) -#ifdef CONFIG_PREEMPT_RT -# define rcu_read_lock_spin(lock) spin_lock(lock) -# define rcu_read_unlock_spin(lock)spin_unlock(lock) -# define rcu_read_lock_read(lock) read_lock(lock) -# define rcu_read_unlock_read(lock)read_unlock(lock) -# define rcu_read_lock_bh_read(lock) read_lock_bh(lock) -# define rcu_read_unlock_bh_read(lock) read_unlock_bh(lock) -# define rcu_read_lock_down_read(rwsem)
Re: Real-Time Preemption and RCU
On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote: > cpu 1: > acquire random networking spin_lock_bh() > > cpu 2: > read_lock(_lock) from process context > interrupt. softirq. within softirq: try to acquire the networking lock. > * spins. > > cpu 1: > hardware interrupt > within hw interrupt: signal delivery. tries to acquire tasklist_lock. > > --> deadlock. Signal delivery from hw interrupt context (interrupt is flagged SA_NODELAY) is not possible in RT preemption mode. The local_irq_save_nort() check in __cache_alloc will catch you. When it happens from a threaded irq handler the situation is solvable by the PI code. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Mar 19, 2005, at 11:31, Ingo Molnar wrote: What about allowing only as many concurrent readers as there are CPUs? since a reader may be preempted by a higher prio task, there is no linear relationship between CPU utilization and the number of readers allowed. You could easily end up having all the nr_cpus readers preempted on one CPU. It gets pretty messy One solution I can think of, although it bloats memory usage for many-way boxen, is to just have a table in the rwlock with one entry per cpu. Each CPU would get one concurrent reader, others would need to sleep Cheers, Kyle Moffett -BEGIN GEEK CODE BLOCK- Version: 3.12 GCM/CS/IT/U d- s++: a18 C>$ UB/L/X/*(+)>$ P+++()>$ L(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+ PGP+++ t+(+++) 5 X R? tv-(--) b(++) DI+ D+ G e->$ h!*()>++$ r !y?(-) --END GEEK CODE BLOCK-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Mar 19, 2005, at 11:31, Ingo Molnar wrote: What about allowing only as many concurrent readers as there are CPUs? since a reader may be preempted by a higher prio task, there is no linear relationship between CPU utilization and the number of readers allowed. You could easily end up having all the nr_cpus readers preempted on one CPU. It gets pretty messy One solution I can think of, although it bloats memory usage for many-way boxen, is to just have a table in the rwlock with one entry per cpu. Each CPU would get one concurrent reader, others would need to sleep Cheers, Kyle Moffett -BEGIN GEEK CODE BLOCK- Version: 3.12 GCM/CS/IT/U d- s++: a18 C$ UB/L/X/*(+)$ P+++()$ L(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+ PGP+++ t+(+++) 5 X R? tv-(--) b(++) DI+ D+ G e-$ h!*()++$ r !y?(-) --END GEEK CODE BLOCK-- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote: cpu 1: acquire random networking spin_lock_bh() cpu 2: read_lock(tasklist_lock) from process context interrupt. softirq. within softirq: try to acquire the networking lock. * spins. cpu 1: hardware interrupt within hw interrupt: signal delivery. tries to acquire tasklist_lock. -- deadlock. Signal delivery from hw interrupt context (interrupt is flagged SA_NODELAY) is not possible in RT preemption mode. The local_irq_save_nort() check in __cache_alloc will catch you. When it happens from a threaded irq handler the situation is solvable by the PI code. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, 18 Mar 2005, Ingo Molnar wrote: * Esben Nielsen [EMAIL PROTECTED] wrote: Why can should there only be one RCU-reader per CPU at each given instance? Even on a real-time UP system it would be very helpfull to have RCU areas to be enterable by several tasks as once. It would perform better, both wrt. latencies and throughput: With the above implementation an high priority task entering an RCU area will have to boost the current RCU reader, make a task switch until that one finishes and makes yet another task switch. to get back to the high priority task. With an RCU implementation which can take n RCU readers per CPU there is no such problem. correct, for RCU we could allow multiple readers per lock, because the 'blocking' side of RCU (callback processing) is never (supposed to be) in any latency path. except if someone wants to make RCU callback processing deterministic at some point. (e.g. for memory management reasons.) I think it can be deterministic (on the long timescale of memory management) anyway: Boost any non-RT task entering an RCU region to the lowest RT priority. This way only all the RT tasks + one non-RT task can be within those regions. The RT-tasks are supposed to have some kind of upper bound to their CPU-usage. The non-RT task will also finish soon as it is boosted. If the RCU batches are also at the lowest RT-priority they can be run immediately after the non-RT task is done. clearly the simplest solution is to go with the single-reader locks for now - a separate experiment could be done with a new type of rwlock that can only be used by the RCU code. (I'm not quite sure whether we could guarantee a minimum rate of RCU callback processing under such a scheme though. It's an eventual memory DoS otherwise.) Why are a lock needed at all? If it is doable without locking for an non-preemptable SMP kernel it must be doable for an preemptable kernel as well.I am convinced some kind of per-CPU rcu_read_count as I specified in my previous mail can work some way or the other. call_rcu() might need to do more complicated stuff and thus use CPU but call_rcu() is supposed to be an relative rare event not worth optimizing for. Such an implementation will work for any preemptable kernel, not only PREEMPT_RT. For performance is considered it is important not to acquire any locks in the rcu-read regions. I tried this approach. My UP labtop did boot on it, but I haven't testet it further. I have included the very small patch as an attachment. Ingo I have not yet looked at -V0.7.41-00... Esben diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 2005-03-11 23:40:13.0 +0100 +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h 2005-03-19 12:47:09.0 +0100 @@ -85,6 +85,7 @@ * curlist - current batch for which quiescent cycle started if any */ struct rcu_data { + longactive_readers; /* 1) quiescent state handling : */ longquiescbatch; /* Batch # for grace period */ int passed_quiesc; /* User-mode/idle loop etc. */ @@ -115,12 +116,14 @@ static inline void rcu_qsctr_inc(int cpu) { struct rcu_data *rdp = per_cpu(rcu_data, cpu); - rdp-passed_quiesc = 1; + if(rdp-active_readers==0) + rdp-passed_quiesc = 1; } static inline void rcu_bh_qsctr_inc(int cpu) { struct rcu_data *rdp = per_cpu(rcu_bh_data, cpu); - rdp-passed_quiesc = 1; + if(rdp-active_readers==0) + rdp-passed_quiesc = 1; } static inline int __rcu_pending(struct rcu_ctrlblk *rcp, @@ -183,29 +186,27 @@ * * It is illegal to block while in an RCU read-side critical section. */ -#define rcu_read_lock()preempt_disable() +static inline void rcu_read_lock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers++; + preempt_enable(); +} /** * rcu_read_unlock - marks the end of an RCU read-side critical section. * * See rcu_read_lock() for more information. */ -#define rcu_read_unlock() preempt_enable() +static inline void rcu_read_unlock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers--; + preempt_enable(); +} #define IGNORE_LOCK(op, lock) do { (void)(lock); op(); } while (0) -#ifdef CONFIG_PREEMPT_RT -# define rcu_read_lock_spin(lock) spin_lock(lock) -# define rcu_read_unlock_spin(lock)spin_unlock(lock) -# define rcu_read_lock_read(lock) read_lock(lock) -# define rcu_read_unlock_read(lock)read_unlock(lock) -# define rcu_read_lock_bh_read(lock) read_lock_bh(lock) -# define rcu_read_unlock_bh_read(lock) read_unlock_bh(lock) -# define rcu_read_lock_down_read(rwsem)down_read(rwsem) -# define
Re: Real-Time Preemption and RCU
Thomas Gleixner wrote: On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote: cpu 1: acquire random networking spin_lock_bh() cpu 2: read_lock(tasklist_lock) from process context interrupt. softirq. within softirq: try to acquire the networking lock. * spins. cpu 1: hardware interrupt within hw interrupt: signal delivery. tries to acquire tasklist_lock. -- deadlock. Signal delivery from hw interrupt context (interrupt is flagged SA_NODELAY) is not possible in RT preemption mode. The local_irq_save_nort() check in __cache_alloc will catch you. That was just one random example. Another one would be : drivers/chat/tty_io.c, __do_SAK() contains read_lock(tasklist_lock); task_lock(p); kernel/sys.c, sys_setrlimit contains task_lock(current-group_leader); read_lock(tasklist_lock); task_lock is a shorthand for spin_lock(p-alloc_lock). If read_lock is a normal spinlock, then this is an A/B B/A deadlock. -- Manfred - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote: That was just one random example. Another one would be : drivers/chat/tty_io.c, __do_SAK() contains read_lock(tasklist_lock); task_lock(p); kernel/sys.c, sys_setrlimit contains task_lock(current-group_leader); read_lock(tasklist_lock); task_lock is a shorthand for spin_lock(p-alloc_lock). If read_lock is a normal spinlock, then this is an A/B B/A deadlock. That code was already dubious in the first place just because it contained that circularity. If you had a rwlock that block on an upper read count maximum a deadlock situation would trigger anyways, say, upon a flood of threads trying to do that sequence of aquires. I'd probably experiment with using the {spin,read,write}-trylock logic and release the all locks contains in a sequence like that on the failure to aquire any of the locks in the chain as an initial fix. A longer term fix might be to break things up a bit so that whatever ordering being done would have that circularity. BTW, the runtime lock cricularity detector was designed to trigger on that situtation anyways. That's my thoughts on the matter. bill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Sun, Mar 20, 2005 at 01:38:24PM -0800, Bill Huey wrote: On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote: That was just one random example. Another one would be : drivers/chat/tty_io.c, __do_SAK() contains read_lock(tasklist_lock); task_lock(p); kernel/sys.c, sys_setrlimit contains task_lock(current-group_leader); read_lock(tasklist_lock); task_lock is a shorthand for spin_lock(p-alloc_lock). If read_lock is a normal spinlock, then this is an A/B B/A deadlock. That code was already dubious in the first place just because it contained that circularity. If you had a rwlock that block on an upper read count maximum[,] a deadlock situation would trigger anyways, say, upon a flood of threads trying to do that sequence of aquires. The RT patch uses the lock ordering in place and whatevery nasty situation was going on previously will be effectively under high load, which increases the chance of it being triggered. Removal of the read side semantic just increases load more so that those cases can trigger. I disagree with this approach and I have an alternate implementation here that restores it. It's only half tested and fairly meaningless until an extreme contention case is revealed with the current rt lock implementation. Numbers need to be gather to prove or disprove this conjecture. I'd probably experiment with using the {spin,read,write}-trylock logic and release the all locks contains in a sequence like that on the failure to aquire any of the locks in the chain as an initial fix. A longer term fix might be to break things up a bit so that whatever ordering being done would have that circularity. Excuse me, ...would *not* have that circularity. bill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote: On Fri, 18 Mar 2005, Ingo Molnar wrote: * Esben Nielsen [EMAIL PROTECTED] wrote: Why can should there only be one RCU-reader per CPU at each given instance? Even on a real-time UP system it would be very helpfull to have RCU areas to be enterable by several tasks as once. It would perform better, both wrt. latencies and throughput: With the above implementation an high priority task entering an RCU area will have to boost the current RCU reader, make a task switch until that one finishes and makes yet another task switch. to get back to the high priority task. With an RCU implementation which can take n RCU readers per CPU there is no such problem. correct, for RCU we could allow multiple readers per lock, because the 'blocking' side of RCU (callback processing) is never (supposed to be) in any latency path. except if someone wants to make RCU callback processing deterministic at some point. (e.g. for memory management reasons.) I think it can be deterministic (on the long timescale of memory management) anyway: Boost any non-RT task entering an RCU region to the lowest RT priority. This way only all the RT tasks + one non-RT task can be within those regions. The RT-tasks are supposed to have some kind of upper bound to their CPU-usage. The non-RT task will also finish soon as it is boosted. If the RCU batches are also at the lowest RT-priority they can be run immediately after the non-RT task is done. Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-) clearly the simplest solution is to go with the single-reader locks for now - a separate experiment could be done with a new type of rwlock that can only be used by the RCU code. (I'm not quite sure whether we could guarantee a minimum rate of RCU callback processing under such a scheme though. It's an eventual memory DoS otherwise.) Why are a lock needed at all? If it is doable without locking for an non-preemptable SMP kernel it must be doable for an preemptable kernel as well.I am convinced some kind of per-CPU rcu_read_count as I specified in my previous mail can work some way or the other. call_rcu() might need to do more complicated stuff and thus use CPU but call_rcu() is supposed to be an relative rare event not worth optimizing for. Such an implementation will work for any preemptable kernel, not only PREEMPT_RT. For performance is considered it is important not to acquire any locks in the rcu-read regions. You definitely don't need a lock -- you can just suppress preemption on the read side instead. But that potentially makes for long scheduling latencies. The counter approach might work, and is also what the implementation #5 does -- check out rcu_read_lock() in Ingo's most recent patch. Thanx, Paul I tried this approach. My UP labtop did boot on it, but I haven't testet it further. I have included the very small patch as an attachment. Ingo I have not yet looked at -V0.7.41-00... Esben diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h2005-03-11 23:40:13.0 +0100 +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h 2005-03-19 12:47:09.0 +0100 @@ -85,6 +85,7 @@ * curlist - current batch for which quiescent cycle started if any */ struct rcu_data { + longactive_readers; /* 1) quiescent state handling : */ longquiescbatch; /* Batch # for grace period */ int passed_quiesc; /* User-mode/idle loop etc. */ @@ -115,12 +116,14 @@ static inline void rcu_qsctr_inc(int cpu) { struct rcu_data *rdp = per_cpu(rcu_data, cpu); - rdp-passed_quiesc = 1; + if(rdp-active_readers==0) + rdp-passed_quiesc = 1; } static inline void rcu_bh_qsctr_inc(int cpu) { struct rcu_data *rdp = per_cpu(rcu_bh_data, cpu); - rdp-passed_quiesc = 1; + if(rdp-active_readers==0) + rdp-passed_quiesc = 1; } static inline int __rcu_pending(struct rcu_ctrlblk *rcp, @@ -183,29 +186,27 @@ * * It is illegal to block while in an RCU read-side critical section. */ -#define rcu_read_lock() preempt_disable() +static inline void rcu_read_lock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers++; + preempt_enable(); +} /** * rcu_read_unlock - marks the end of an RCU read-side critical section. * * See rcu_read_lock() for more information. */ -#define rcu_read_unlock()preempt_enable() +static inline void rcu_read_unlock(void) +{ + preempt_disable(); + __get_cpu_var(rcu_data).active_readers--; +
Re: Real-Time Preemption and RCU
On Sun, 20 Mar 2005, Paul E. McKenney wrote: On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote: On Fri, 18 Mar 2005, Ingo Molnar wrote: [...] I think it can be deterministic (on the long timescale of memory management) anyway: Boost any non-RT task entering an RCU region to the lowest RT priority. This way only all the RT tasks + one non-RT task can be within those regions. The RT-tasks are supposed to have some kind of upper bound to their CPU-usage. The non-RT task will also finish soon as it is boosted. If the RCU batches are also at the lowest RT-priority they can be run immediately after the non-RT task is done. Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-) Well, I was actually thinking of an API like preempt_by_nonrt_disable() preempt_by_nonrt_enable() working like the old preempt_disable()/preempt_enable() but still allowing RT tasks (as well as priority inheriting non-RT tasks) to be scheduled. That would kind of help split the kernel into two halfs: the RT part and the non-RT part. The non-RT part would in many ways work as it has always done. clearly the simplest solution is to go with the single-reader locks for now - a separate experiment could be done with a new type of rwlock that can only be used by the RCU code. (I'm not quite sure whether we could guarantee a minimum rate of RCU callback processing under such a scheme though. It's an eventual memory DoS otherwise.) Why are a lock needed at all? If it is doable without locking for an non-preemptable SMP kernel it must be doable for an preemptable kernel as well.I am convinced some kind of per-CPU rcu_read_count as I specified in my previous mail can work some way or the other. call_rcu() might need to do more complicated stuff and thus use CPU but call_rcu() is supposed to be an relative rare event not worth optimizing for. Such an implementation will work for any preemptable kernel, not only PREEMPT_RT. For performance is considered it is important not to acquire any locks in the rcu-read regions. You definitely don't need a lock -- you can just suppress preemption on the read side instead. But that potentially makes for long scheduling latencies. Well, in my patch I do not leave preemption off - only while doing the simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader might be active, i.e. preempted, on the current CPU such that this isn't and quiescent point after all. (To others: Paul nicely unfolded my attachment below - I left it in the mail such you can read it.) The problem with this approach is ofcourse that user space programs might preempt an RCU reader for a very long time such that RCU batches are never really run. The boosting of non-RT tasks mentioned above would help a lot. A plus(?) in it: You can actually sleep while having the rcu_read_lock !! The counter approach might work, and is also what the implementation #5 does -- check out rcu_read_lock() in Ingo's most recent patch. Do you refer to your original mail with implementing it in 5 steps? In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that forces syncronization between the CPUs - bad for scaling. It does make the RCU batches somewhat deterministic, as the RCU task can boost the readers to the rcu-task's priority. The problem about this approach is that everybody calling into RCU code have a worst case behaviour of the systemwide worst case RCU reader section - which can be pretty large (in principle infinite if somebody.) So if somebody uses a call to a function in the code containing a RCU read area the worst case behavious would be the same as teh worst case latency in the simple world where preempt_disable()/preempt_enable() was used. Thanx, Paul I tried this approach. My UP labtop did boot on it, but I haven't testet it further. I have included the very small patch as an attachment. Ingo I have not yet looked at -V0.7.41-00... Esben diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 2005-03-11 23:40:13.0 +0100 +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h 2005-03-19 12:47:09.0 +0100 @@ -85,6 +85,7 @@ * curlist - current batch for which quiescent cycle started if any */ struct rcu_data { + longactive_readers; /* 1) quiescent state handling : */ longquiescbatch; /* Batch # for grace period */ int passed_quiesc; /* User-mode/idle loop etc. */ @@ -115,12 +116,14 @@ static inline void rcu_qsctr_inc(int cpu) { struct rcu_data *rdp = per_cpu(rcu_data, cpu); - rdp-passed_quiesc = 1; +
Re: Real-Time Preemption and RCU
Ingo Molnar wrote: which precise locking situation do you mean? cpu 1: acquire random networking spin_lock_bh() cpu 2: read_lock(_lock) from process context interrupt. softirq. within softirq: try to acquire the networking lock. * spins. cpu 1: hardware interrupt within hw interrupt: signal delivery. tries to acquire tasklist_lock. --> deadlock. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Herbert Xu <[EMAIL PROTECTED]> wrote: > Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > i really have no intention to allow multiple readers for rt-mutexes. We > > got away with that so far, and i'd like to keep it so. Imagine 100 > > threads all blocked in the same critical section (holding the read-lock) > > when a highprio writer thread comes around: instant 100x latency to let > > all of them roll forward. The only sane solution is to not allow > > excessive concurrency. (That limits SMP scalability, but there's no > > other choice i can see.) > > What about allowing only as many concurrent readers as there are CPUs? since a reader may be preempted by a higher prio task, there is no linear relationship between CPU utilization and the number of readers allowed. You could easily end up having all the nr_cpus readers preempted on one CPU. It gets pretty messy. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Manfred Spraul <[EMAIL PROTECTED]> wrote: > Ingo Molnar wrote: > > > read_lock(); > > ... > > read_lock(); > > > >are still legal. (it's also done quite often.) > > > > > > > > How do you handle the write_lock_irq()/read_lock locks? E.g. the > tasklist_lock or the fasync_lock. which precise locking situation do you mean? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Manfred Spraul [EMAIL PROTECTED] wrote: Ingo Molnar wrote: read_lock(rwlock); ... read_lock(rwlock); are still legal. (it's also done quite often.) How do you handle the write_lock_irq()/read_lock locks? E.g. the tasklist_lock or the fasync_lock. which precise locking situation do you mean? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Herbert Xu [EMAIL PROTECTED] wrote: Ingo Molnar [EMAIL PROTECTED] wrote: i really have no intention to allow multiple readers for rt-mutexes. We got away with that so far, and i'd like to keep it so. Imagine 100 threads all blocked in the same critical section (holding the read-lock) when a highprio writer thread comes around: instant 100x latency to let all of them roll forward. The only sane solution is to not allow excessive concurrency. (That limits SMP scalability, but there's no other choice i can see.) What about allowing only as many concurrent readers as there are CPUs? since a reader may be preempted by a higher prio task, there is no linear relationship between CPU utilization and the number of readers allowed. You could easily end up having all the nr_cpus readers preempted on one CPU. It gets pretty messy. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
Ingo Molnar wrote: which precise locking situation do you mean? cpu 1: acquire random networking spin_lock_bh() cpu 2: read_lock(tasklist_lock) from process context interrupt. softirq. within softirq: try to acquire the networking lock. * spins. cpu 1: hardware interrupt within hw interrupt: signal delivery. tries to acquire tasklist_lock. -- deadlock. -- Manfred - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
Ingo Molnar wrote: read_lock(); ... read_lock(); are still legal. (it's also done quite often.) How do you handle the write_lock_irq()/read_lock locks? E.g. the tasklist_lock or the fasync_lock. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 02:22:30PM -0800, Paul E. McKenney wrote: > On Fri, Mar 18, 2005 at 12:35:17PM -0800, Paul E. McKenney wrote: > > Compiles, probably dies horribly. "diff" didn't do such a good job > > on this one, so attaching the raw rcupdate.[hc] files as well. > > My prediction was all too accurate. ;-) > > The attached patch at least boots on a 1-CPU x86 box. I added some > interrupt disabling that is a bad idea in real-time preempt kernels, > but necessary for stock kernels to even have a ghost of a chance. > > Again, the diff is quite confusing to read (for me, anyway!), so attached > the rcupdate.[hc] files. > > Assuming this patch survives the LTP run (hah!!!), next step is a small > SMP system. It did survive the UP LTP run, here is an updated patch with a fix needed for SMP. Having trouble with test system (independent of this patch), hopefully will get more testing in over the weekend. Thanx, Paul Signed-off-by: <[EMAIL PROTECTED]> diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h linux-2.5-rtRCU/include/linux/rcupdate.h --- linux-2.5/include/linux/rcupdate.h Wed Mar 9 12:37:06 2005 +++ linux-2.5-rtRCU/include/linux/rcupdate.hFri Mar 18 11:37:02 2005 @@ -58,169 +58,11 @@ struct rcu_head { (ptr)->next = NULL; (ptr)->func = NULL; \ } while (0) - - -/* Global control variables for rcupdate callback mechanism. */ -struct rcu_ctrlblk { - longcur;/* Current batch number. */ - longcompleted; /* Number of the last completed batch */ - int next_pending; /* Is the next batch already waiting? */ -} cacheline_maxaligned_in_smp; - -/* Is batch a before batch b ? */ -static inline int rcu_batch_before(long a, long b) -{ -return (a - b) < 0; -} - -/* Is batch a after batch b ? */ -static inline int rcu_batch_after(long a, long b) -{ -return (a - b) > 0; -} - -/* - * Per-CPU data for Read-Copy UPdate. - * nxtlist - new callbacks are added here - * curlist - current batch for which quiescent cycle started if any - */ -struct rcu_data { - /* 1) quiescent state handling : */ - longquiescbatch; /* Batch # for grace period */ - int passed_quiesc; /* User-mode/idle loop etc. */ - int qs_pending; /* core waits for quiesc state */ - - /* 2) batch handling */ - longbatch; /* Batch # for current RCU batch */ - struct rcu_head *nxtlist; - struct rcu_head **nxttail; - struct rcu_head *curlist; - struct rcu_head **curtail; - struct rcu_head *donelist; - struct rcu_head **donetail; - int cpu; -}; - -DECLARE_PER_CPU(struct rcu_data, rcu_data); -DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); -extern struct rcu_ctrlblk rcu_ctrlblk; -extern struct rcu_ctrlblk rcu_bh_ctrlblk; - -/* - * Increment the quiescent state counter. - * The counter is a bit degenerated: We do not need to know - * how many quiescent states passed, just if there was at least - * one since the start of the grace period. Thus just a flag. - */ -static inline void rcu_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = _cpu(rcu_data, cpu); - rdp->passed_quiesc = 1; -} -static inline void rcu_bh_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = _cpu(rcu_bh_data, cpu); - rdp->passed_quiesc = 1; -} - -static inline int __rcu_pending(struct rcu_ctrlblk *rcp, - struct rcu_data *rdp) -{ - /* This cpu has pending rcu entries and the grace period -* for them has completed. -*/ - if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) - return 1; - - /* This cpu has no pending entries, but there are new entries */ - if (!rdp->curlist && rdp->nxtlist) - return 1; - - /* This cpu has finished callbacks to invoke */ - if (rdp->donelist) - return 1; - - /* The rcu core waits for a quiescent state from the cpu */ - if (rdp->quiescbatch != rcp->cur || rdp->qs_pending) - return 1; - - /* nothing to do */ - return 0; -} - -static inline int rcu_pending(int cpu) -{ - return __rcu_pending(_ctrlblk, _cpu(rcu_data, cpu)) || - __rcu_pending(_bh_ctrlblk, _cpu(rcu_bh_data, cpu)); -} - -/** - * rcu_read_lock - mark the beginning of an RCU read-side critical section. - * - * When synchronize_kernel() is invoked on one CPU while other CPUs - * are within RCU read-side critical sections, then the - * synchronize_kernel() is guaranteed to block until after all the other - * CPUs exit their critical sections. Similarly, if call_rcu() is invoked - * on one CPU while other CPUs are within RCU read-side critical - * sections, invocation of the corresponding RCU callback is deferred - * until after the all the other CPUs exit
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 12:35:17PM -0800, Paul E. McKenney wrote: > Compiles, probably dies horribly. "diff" didn't do such a good job > on this one, so attaching the raw rcupdate.[hc] files as well. My prediction was all too accurate. ;-) The attached patch at least boots on a 1-CPU x86 box. I added some interrupt disabling that is a bad idea in real-time preempt kernels, but necessary for stock kernels to even have a ghost of a chance. Again, the diff is quite confusing to read (for me, anyway!), so attached the rcupdate.[hc] files. Assuming this patch survives the LTP run (hah!!!), next step is a small SMP system. Thanx, Paul Signed-off-by: <[EMAIL PROTECTED]> diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h linux-2.5-rtRCU/include/linux/rcupdate.h --- linux-2.5/include/linux/rcupdate.h Wed Mar 9 12:37:06 2005 +++ linux-2.5-rtRCU/include/linux/rcupdate.hFri Mar 18 11:37:02 2005 @@ -58,169 +58,11 @@ struct rcu_head { (ptr)->next = NULL; (ptr)->func = NULL; \ } while (0) - - -/* Global control variables for rcupdate callback mechanism. */ -struct rcu_ctrlblk { - longcur;/* Current batch number. */ - longcompleted; /* Number of the last completed batch */ - int next_pending; /* Is the next batch already waiting? */ -} cacheline_maxaligned_in_smp; - -/* Is batch a before batch b ? */ -static inline int rcu_batch_before(long a, long b) -{ -return (a - b) < 0; -} - -/* Is batch a after batch b ? */ -static inline int rcu_batch_after(long a, long b) -{ -return (a - b) > 0; -} - -/* - * Per-CPU data for Read-Copy UPdate. - * nxtlist - new callbacks are added here - * curlist - current batch for which quiescent cycle started if any - */ -struct rcu_data { - /* 1) quiescent state handling : */ - longquiescbatch; /* Batch # for grace period */ - int passed_quiesc; /* User-mode/idle loop etc. */ - int qs_pending; /* core waits for quiesc state */ - - /* 2) batch handling */ - longbatch; /* Batch # for current RCU batch */ - struct rcu_head *nxtlist; - struct rcu_head **nxttail; - struct rcu_head *curlist; - struct rcu_head **curtail; - struct rcu_head *donelist; - struct rcu_head **donetail; - int cpu; -}; - -DECLARE_PER_CPU(struct rcu_data, rcu_data); -DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); -extern struct rcu_ctrlblk rcu_ctrlblk; -extern struct rcu_ctrlblk rcu_bh_ctrlblk; - -/* - * Increment the quiescent state counter. - * The counter is a bit degenerated: We do not need to know - * how many quiescent states passed, just if there was at least - * one since the start of the grace period. Thus just a flag. - */ -static inline void rcu_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = _cpu(rcu_data, cpu); - rdp->passed_quiesc = 1; -} -static inline void rcu_bh_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = _cpu(rcu_bh_data, cpu); - rdp->passed_quiesc = 1; -} - -static inline int __rcu_pending(struct rcu_ctrlblk *rcp, - struct rcu_data *rdp) -{ - /* This cpu has pending rcu entries and the grace period -* for them has completed. -*/ - if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) - return 1; - - /* This cpu has no pending entries, but there are new entries */ - if (!rdp->curlist && rdp->nxtlist) - return 1; - - /* This cpu has finished callbacks to invoke */ - if (rdp->donelist) - return 1; - - /* The rcu core waits for a quiescent state from the cpu */ - if (rdp->quiescbatch != rcp->cur || rdp->qs_pending) - return 1; - - /* nothing to do */ - return 0; -} - -static inline int rcu_pending(int cpu) -{ - return __rcu_pending(_ctrlblk, _cpu(rcu_data, cpu)) || - __rcu_pending(_bh_ctrlblk, _cpu(rcu_bh_data, cpu)); -} - -/** - * rcu_read_lock - mark the beginning of an RCU read-side critical section. - * - * When synchronize_kernel() is invoked on one CPU while other CPUs - * are within RCU read-side critical sections, then the - * synchronize_kernel() is guaranteed to block until after all the other - * CPUs exit their critical sections. Similarly, if call_rcu() is invoked - * on one CPU while other CPUs are within RCU read-side critical - * sections, invocation of the corresponding RCU callback is deferred - * until after the all the other CPUs exit their critical sections. - * - * Note, however, that RCU callbacks are permitted to run concurrently - * with RCU read-side critical sections. One way that this can happen - * is via the following sequence of events: (1) CPU 0 enters an RCU - * read-side critical section, (2) CPU 1 invokes
Re: Real-Time Preemption and RCU
Ingo Molnar <[EMAIL PROTECTED]> wrote: > > i really have no intention to allow multiple readers for rt-mutexes. We > got away with that so far, and i'd like to keep it so. Imagine 100 > threads all blocked in the same critical section (holding the read-lock) > when a highprio writer thread comes around: instant 100x latency to let > all of them roll forward. The only sane solution is to not allow > excessive concurrency. (That limits SMP scalability, but there's no > other choice i can see.) What about allowing only as many concurrent readers as there are CPUs? -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 06:11:26PM +0100, Ingo Molnar wrote: > > * Paul E. McKenney <[EMAIL PROTECTED]> wrote: > > > For the patch, here are my questions: > > > > o What is the best way to select between classic RCU and this > > scheme? > > > > 1. Massive #ifdef across rcupdate.c > > > > 2. Create an rcupdate_rt.c and browbeat the build system > > into picking one or the other (no clue if this is > > possible...) > > > > 3. Create an rcupdate_rt.c and rely on the linker to pick > > one or the other, with rcupdate.h generating different > > external symbol names to make the choice. > > you can also go for option #0: just replace the existing RCU code with > the new one, and i'll then deal with the configuration details. > > what will have to happen is most likely #2 (since there is near zero > code sharing between the two variants, right?). Picking rcupdate_rt.c is > as simple as doing this: > > obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o > obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o > > and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU > (default) or CONFIG_PREEMPT_RCU (if the user selects it). > > but it's not yet clear whether we want to offer this to users as a > configurable option. The simplest solution for you would be to go with > option #0 :-) [or if you prefer switchability, #1 is good too - i can > then extract the bits and do #2 based on that.] > > > o How best to interface to OOM? Left to myself, I leave this > > for later. ;-) > > yeah, i'd not worry about OOM that much at this stage. > > > I will take the cowardly approach of patching against the upstream > > kernel. > > sure. This is in fact easier for me: i'll first rip all my RCU hackery > out of -RT and then add your code, so the base i'll be merging against > will be closer to upstream than to current -RT. Compiles, probably dies horribly. "diff" didn't do such a good job on this one, so attaching the raw rcupdate.[hc] files as well. Thanx, Paul Signed-off-by: <[EMAIL PROTECTED]> diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h linux-2.5-rtRCU/include/linux/rcupdate.h --- linux-2.5/include/linux/rcupdate.h Wed Mar 9 12:37:06 2005 +++ linux-2.5-rtRCU/include/linux/rcupdate.hFri Mar 18 11:37:02 2005 @@ -58,169 +58,11 @@ struct rcu_head { (ptr)->next = NULL; (ptr)->func = NULL; \ } while (0) - - -/* Global control variables for rcupdate callback mechanism. */ -struct rcu_ctrlblk { - longcur;/* Current batch number. */ - longcompleted; /* Number of the last completed batch */ - int next_pending; /* Is the next batch already waiting? */ -} cacheline_maxaligned_in_smp; - -/* Is batch a before batch b ? */ -static inline int rcu_batch_before(long a, long b) -{ -return (a - b) < 0; -} - -/* Is batch a after batch b ? */ -static inline int rcu_batch_after(long a, long b) -{ -return (a - b) > 0; -} - -/* - * Per-CPU data for Read-Copy UPdate. - * nxtlist - new callbacks are added here - * curlist - current batch for which quiescent cycle started if any - */ -struct rcu_data { - /* 1) quiescent state handling : */ - longquiescbatch; /* Batch # for grace period */ - int passed_quiesc; /* User-mode/idle loop etc. */ - int qs_pending; /* core waits for quiesc state */ - - /* 2) batch handling */ - longbatch; /* Batch # for current RCU batch */ - struct rcu_head *nxtlist; - struct rcu_head **nxttail; - struct rcu_head *curlist; - struct rcu_head **curtail; - struct rcu_head *donelist; - struct rcu_head **donetail; - int cpu; -}; - -DECLARE_PER_CPU(struct rcu_data, rcu_data); -DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); -extern struct rcu_ctrlblk rcu_ctrlblk; -extern struct rcu_ctrlblk rcu_bh_ctrlblk; - -/* - * Increment the quiescent state counter. - * The counter is a bit degenerated: We do not need to know - * how many quiescent states passed, just if there was at least - * one since the start of the grace period. Thus just a flag. - */ -static inline void rcu_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = _cpu(rcu_data, cpu); - rdp->passed_quiesc = 1; -} -static inline void rcu_bh_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = _cpu(rcu_bh_data, cpu); - rdp->passed_quiesc = 1; -} - -static inline int __rcu_pending(struct rcu_ctrlblk *rcp, - struct rcu_data *rdp) -{ - /* This cpu has pending rcu entries and the grace period -* for them has completed. -*/ - if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) - return 1; - - /* This cpu has no pending entries, but
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 06:11:26PM +0100, Ingo Molnar wrote: > > * Paul E. McKenney <[EMAIL PROTECTED]> wrote: > > > For the patch, here are my questions: > > > > o What is the best way to select between classic RCU and this > > scheme? > > > > 1. Massive #ifdef across rcupdate.c > > > > 2. Create an rcupdate_rt.c and browbeat the build system > > into picking one or the other (no clue if this is > > possible...) > > > > 3. Create an rcupdate_rt.c and rely on the linker to pick > > one or the other, with rcupdate.h generating different > > external symbol names to make the choice. > > you can also go for option #0: just replace the existing RCU code with > the new one, and i'll then deal with the configuration details. Having just spent the past few minutes descending into #ifdef hell, I agree completely with your option #0. > what will have to happen is most likely #2 (since there is near zero > code sharing between the two variants, right?). Picking rcupdate_rt.c is > as simple as doing this: > > obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o > obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o > > and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU > (default) or CONFIG_PREEMPT_RCU (if the user selects it). Cool! Thank you for the tutorial! And yes, there is no shared code that I can see. > but it's not yet clear whether we want to offer this to users as a > configurable option. The simplest solution for you would be to go with > option #0 :-) [or if you prefer switchability, #1 is good too - i can > then extract the bits and do #2 based on that.] Option #0 it is -- I will stick with the locking algorithms and let wiser and more experienced heads work out the configuration issues. > > o How best to interface to OOM? Left to myself, I leave this > > for later. ;-) > > yeah, i'd not worry about OOM that much at this stage. > > > I will take the cowardly approach of patching against the upstream > > kernel. > > sure. This is in fact easier for me: i'll first rip all my RCU hackery > out of -RT and then add your code, so the base i'll be merging against > will be closer to upstream than to current -RT. Sounds good to me! Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Esben Nielsen <[EMAIL PROTECTED]> wrote: > Why can should there only be one RCU-reader per CPU at each given > instance? Even on a real-time UP system it would be very helpfull to > have RCU areas to be enterable by several tasks as once. It would > perform better, both wrt. latencies and throughput: With the above > implementation an high priority task entering an RCU area will have to > boost the current RCU reader, make a task switch until that one > finishes and makes yet another task switch. to get back to the high > priority task. With an RCU implementation which can take n RCU readers > per CPU there is no such problem. correct, for RCU we could allow multiple readers per lock, because the 'blocking' side of RCU (callback processing) is never (supposed to be) in any latency path. except if someone wants to make RCU callback processing deterministic at some point. (e.g. for memory management reasons.) clearly the simplest solution is to go with the single-reader locks for now - a separate experiment could be done with a new type of rwlock that can only be used by the RCU code. (I'm not quite sure whether we could guarantee a minimum rate of RCU callback processing under such a scheme though. It's an eventual memory DoS otherwise.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney <[EMAIL PROTECTED]> wrote: > For the patch, here are my questions: > > o What is the best way to select between classic RCU and this > scheme? > > 1. Massive #ifdef across rcupdate.c > > 2. Create an rcupdate_rt.c and browbeat the build system > into picking one or the other (no clue if this is > possible...) > > 3. Create an rcupdate_rt.c and rely on the linker to pick > one or the other, with rcupdate.h generating different > external symbol names to make the choice. you can also go for option #0: just replace the existing RCU code with the new one, and i'll then deal with the configuration details. what will have to happen is most likely #2 (since there is near zero code sharing between the two variants, right?). Picking rcupdate_rt.c is as simple as doing this: obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU (default) or CONFIG_PREEMPT_RCU (if the user selects it). but it's not yet clear whether we want to offer this to users as a configurable option. The simplest solution for you would be to go with option #0 :-) [or if you prefer switchability, #1 is good too - i can then extract the bits and do #2 based on that.] > o How best to interface to OOM? Left to myself, I leave this > for later. ;-) yeah, i'd not worry about OOM that much at this stage. > I will take the cowardly approach of patching against the upstream > kernel. sure. This is in fact easier for me: i'll first rip all my RCU hackery out of -RT and then add your code, so the base i'll be merging against will be closer to upstream than to current -RT. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, 18 Mar 2005, Ingo Molnar wrote: > > * Bill Huey <[EMAIL PROTECTED]> wrote: > > > I'd like to note another problem. Mingo's current implementation of > > rt_mutex (super mutex for all blocking synchronization) is still > > missing reader counts and something like that would have to be > > implemented if you want to do priority inheritance over blocks. > > i really have no intention to allow multiple readers for rt-mutexes. We > got away with that so far, and i'd like to keep it so. Imagine 100 > threads all blocked in the same critical section (holding the read-lock) > when a highprio writer thread comes around: instant 100x latency to let > all of them roll forward. The only sane solution is to not allow > excessive concurrency. (That limits SMP scalability, but there's no > other choice i can see.) > Unless a design change is made: One could argue for a semantics where write-locking _isn't_ deterministic and thus do not have to boost all the readers. Readers boost the writers but not the other way around. Readers will be deterministic, but not writers. Such a semantics would probably work for a lot of RT applications happening not to take any write-locks - these will in fact perform better. But it will give the rest a lot of problems. > Ingo Esben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, 18 Mar 2005, Ingo Molnar wrote: > > * Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > [...] How about something like: > > > > void > > rcu_read_lock(void) > > { > > preempt_disable(); > > if (current->rcu_read_lock_nesting++ == 0) { > > current->rcu_read_lock_ptr = > > &__get_cpu_var(rcu_data).lock; > > preempt_enable(); > > read_lock(current->rcu_read_lock_ptr); > > } else > > preempt_enable(); > > } > > > > this would still make it 'statistically scalable' - but is it correct? > > thinking some more about it, i believe it's correct, because it picks > one particular CPU's lock and correctly releases that lock. > > (read_unlock() is atomic even on PREEMPT_RT, so rcu_read_unlock() is > fine.) > Why can should there only be one RCU-reader per CPU at each given instance? Even on a real-time UP system it would be very helpfull to have RCU areas to be enterable by several tasks as once. It would perform better, both wrt. latencies and throughput: With the above implementation an high priority task entering an RCU area will have to boost the current RCU reader, make a task switch until that one finishes and makes yet another task switch. to get back to the high priority task. With an RCU implementation which can take n RCU readers per CPU there is no such problem. Also having all tasks serializing on one lock (per CPU) really destroys the real-time properties: The latency of anything which uses RCU will be the worst latency of anything done under the RCU lock. When I looked briefly at it in the fall the following solution jumped into mind: Have a RCU-reader count, rcu_read_count, for each CPU. When you enter an RCU read region increment it and decrement it when you go out of it. When it is 0, RCU cleanups are allowed - a perfect quiescent state. At that point call rcu_qsctr_inc() at that point. Or call it in schedule() as now just with a if(rcu_read_count==0) around it. I don't think I understand the current code. But if it works now with preempt_disable()/preempt_enable() around all the read-regions it ought to work with preempt_enable(); rcu_read_count++/--; preempt_disable() around the same regions and the above check for rcu_read_count==0 in or around rcu_qsctr_inc() as well. It might take a long time before the rcu-batches are actually called, though, but that is a different story, which can be improved upon. An improvemnt would be to boost the none-RT tasks entering a rcu-read region into the lowest RT-priority. That way there can't be a lot of low priority tasks hanging around making rcu_read_count non-zero for a long period of time since these tasks only can be preempted by RT tasks while in the RCU-region. > Ingo Esben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 08:49:03AM +0100, Ingo Molnar wrote: > > * Paul E. McKenney <[EMAIL PROTECTED]> wrote: > > > Seems to me that it would be good to have an RCU implementation that > > meet the requirements of the Real-Time Preemption patch, but that is > > 100% compatible with the "classic RCU" API. Such an implementation > > must meet a number of requirements, which are listed at the end of > > this message (search for "REQUIREMENTS"). > > [ Wow. you must be a secret telepatic mind-reader - yesterday i was > thinking about mailing you, because my approach to RCU preemptability > (the API variants) clearly sucked and caused problems all around, both > in terms of maintainability and in terms of stability and > scalability. ] Wish I could claim that, but must credit dumb luck. ;-) > > 5. The final version, which both scales and meets realtime > > requirements, as well as exactly matching the "classic RCU" > > API. > > > > I have tested this approach, but in user-level scaffolding. All of > > these implementations should therefore be regarded with great > > suspicion: untested, probably don't even compile. Besides which, I > > certainly can't claim to fully understand the real-time preempt patch, > > so I am bound to have gotten something wrong somewhere. In any case, > > none of these implementations are a suitable replacement for "classic > > RCU" on large servers, since they acquire locks in the RCU read-side > > critical sections. However, they should scale enough to support small > > SMP systems, inflicting only a modest performance penalty. > > basically for PREEMPT_RT the only constraint is that RCU sections should > be preemptable. Whatever the performance cost. If PREEMPT_RT is merged > into the upstream kernel then it will (at least initially) be at a > status similar to NOMMU: it will be tolerated as long as it causes no > 'drag' on the main code. The RCU API variants i introduced clearly > violated this requirement, and were my #1 worry wrt. upstream > mergability. Fair enough! > > I believe that implementation #5 is most appropriate for real-time > > preempt kernels. [...] > > yeah, agreed - it looks perfect - both the read and write side is > preemptable. Can i just plug the code you sent into rcupudate.c and > expect it to work, or would you like to send a patch? If you prefer you > can make it an unconditional patch against an upstream kernel to keep > things simple for you - i'll then massage it to be properly PREEMPT_RT > dependent. I will give it a shot, but feel free to grab anything at any point, starting either from the patch I will send in a bit or from my earlier email. Do whatever works best for you. For the patch, here are my questions: o What is the best way to select between classic RCU and this scheme? 1. Massive #ifdef across rcupdate.c 2. Create an rcupdate_rt.c and browbeat the build system into picking one or the other (no clue if this is possible...) 3. Create an rcupdate_rt.c and rely on the linker to pick one or the other, with rcupdate.h generating different external symbol names to make the choice. Left to myself, I would grit my teeth and do #1, since I understand how to do it. Flames welcome, as long as accompanied by good advice! o How best to interface to OOM? Left to myself, I leave this for later. ;-) I will take the cowardly approach of patching against the upstream kernel. > > [...] In theory, #3 might be appropriate, but if I understand the > > real-time preempt implementation of reader-writer lock, it will not > > perform well if there are long RCU read-side critical sections, even > > in UP kernels. > > all RCU-locked sections must be preemptable in -RT. Basically RCU is a > mainstream API that is used by lots of code and will be introduced in > many other areas as well. From the -RT kernel's POV sees this as an > 'uncontrollable latency source', which keeps introducing critical > sections. One major goal of PREEMPT_RT is to convert all popular > critical section APIs into preemptible sections, so that the amount of > code that is non-preemptable is drastically reduced and can be managed > (and thus can be trusted). This goal has a higher priority than any > performance consideration, because it doesnt matter what performance you > have, if you cannot trust the kernel to be deterministic. Makes sense to me! Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Bill Huey <[EMAIL PROTECTED]> wrote: > I'd like to note another problem. Mingo's current implementation of > rt_mutex (super mutex for all blocking synchronization) is still > missing reader counts and something like that would have to be > implemented if you want to do priority inheritance over blocks. i really have no intention to allow multiple readers for rt-mutexes. We got away with that so far, and i'd like to keep it so. Imagine 100 threads all blocked in the same critical section (holding the read-lock) when a highprio writer thread comes around: instant 100x latency to let all of them roll forward. The only sane solution is to not allow excessive concurrency. (That limits SMP scalability, but there's no other choice i can see.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney <[EMAIL PROTECTED]> wrote: > > > preempt_disable(); > > > if (current->rcu_read_lock_nesting++ == 0) { > > > current->rcu_read_lock_ptr = > > > &__get_cpu_var(rcu_data).lock; > > > read_lock(current->rcu_read_lock_ptr); > > > } > > > preempt_enable(); > My current thought is that the preempt_disable()/preempt_enable() can > be dropped entirely. Messes up any tool that browses through > ->rcu_read_lock_nesting, but don't see any other problem. Yet, > anyway! yeah - this sounds good. (We are not aiming for irq-safe RCU anyway, on PREEMPT_RT.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 05:17:29AM -0800, Bill Huey wrote: > On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote: > > On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: > > > 5. Scalability -and- Realtime Response. > > ... > > > > > void > > > rcu_read_lock(void) > > > { > > > preempt_disable(); > > > if (current->rcu_read_lock_nesting++ == 0) { > > > current->rcu_read_lock_ptr = > > > &__get_cpu_var(rcu_data).lock; > > > read_lock(current->rcu_read_lock_ptr); > > > } > > > preempt_enable(); > > > } > > > > Ok, here's a rather unsure question... > > > > Uh, is that a sleep violation if that is exclusively held since it > > can block within an atomic critical section (deadlock) ? > > I'd like to note another problem. Mingo's current implementation of rt_mutex > (super mutex for all blocking synchronization) is still missing reader counts > and something like that would have to be implemented if you want to do > priority > inheritance over blocks. > > This is going to throw a wrench into your implementation if you assume that. If we need to do priority inheritance across the memory allocator, so that high-priority tasks blocked waiting for memory pass their priority on to RCU readers, agreed. But I don't see any sign that real-time preempt does this. In absence of this, as Ingo noted, the fact that readers don't block each other should make things be safe. I think... Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote: > On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: > > 5. Scalability -and- Realtime Response. > ... > > > void > > rcu_read_lock(void) > > { > > preempt_disable(); > > if (current->rcu_read_lock_nesting++ == 0) { > > current->rcu_read_lock_ptr = > > &__get_cpu_var(rcu_data).lock; > > read_lock(current->rcu_read_lock_ptr); > > } > > preempt_enable(); > > } > > Ok, here's a rather unsure question... > > Uh, is that a sleep violation if that is exclusively held since it > can block within an atomic critical section (deadlock) ? Hey, I wasn't joking when I said that I probably got something wrong! ;-) My current thought is that the preempt_disable()/preempt_enable() can be dropped entirely. Messes up any tool that browses through ->rcu_read_lock_nesting, but don't see any other problem. Yet, anyway! Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 11:03:39AM +0100, Ingo Molnar wrote: > > there's a problem in #5's rcu_read_lock(): > > void > rcu_read_lock(void) > { > preempt_disable(); > if (current->rcu_read_lock_nesting++ == 0) { > current->rcu_read_lock_ptr = > &__get_cpu_var(rcu_data).lock; > read_lock(current->rcu_read_lock_ptr); > } > preempt_enable(); > } > > not only are read_lock()-ed sections preemptible, read_lock() itself may > block, so it cannot be called from within preempt_disable(). How about > something like: > > void > rcu_read_lock(void) > { > preempt_disable(); > if (current->rcu_read_lock_nesting++ == 0) { > current->rcu_read_lock_ptr = > &__get_cpu_var(rcu_data).lock; > preempt_enable(); > read_lock(current->rcu_read_lock_ptr); > } else > preempt_enable(); > } > > this would still make it 'statistically scalable' - but is it correct? Good catch! Also good question... Strictly speaking, it is not necessary to block callback invocation until just after rcu_read_lock() returns. It is correct as long as there is no sort of "upcall" or "callback" that can masquerade as this task. I know of no such thing in the Linux kernel. In fact such a thing would break a lot of code, right? Any tool that relied on the ->rcu_read_lock_nesting counter to deduce RCU state would be confused by this change, but there might be other ways of handling this. Also, we are currently making do without such a tool. It should be possible to move the preempt_enable() further forward ahead of the assignment to ->rcu_read_lock_ptr, since the assignment to ->rcu_read_lock_ptr is strictly local. Not sure that this is worthwhile, thoughts? void rcu_read_lock(void) { preempt_disable(); if (current->rcu_read_lock_nesting++ == 0) { preempt_enable(); current->rcu_read_lock_ptr = &__get_cpu_var(rcu_data).lock; read_lock(current->rcu_read_lock_ptr); } else preempt_enable(); } The other question is whether preempt_disable() is needed in the first place. The two task-structure fields are not accessed except by the task itself. I bet that the following is just fine: void rcu_read_lock(void) { if (current->rcu_read_lock_nesting++ == 0) { current->rcu_read_lock_ptr = &__get_cpu_var(rcu_data).lock; read_lock(current->rcu_read_lock_ptr); } } Thoughts? Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 10:53:27AM +0100, Ingo Molnar wrote: > > * Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > there's one detail on PREEMPT_RT though (which i think you noticed too). > > > > Priority inheritance handling can be done in a pretty straightforward > > way as long as no true read-side nesting is allowed for rwsems and > > rwlocks - i.e. there's only one owner of a lock at a time. So > > PREEMPT_RT restricts rwsem and rwlock concurrency: readers are > > writers, with the only exception that they are allowed to 'self-nest'. > > [...] > > this does not affect read-side RCU, because read-side RCU can never > block a higher-prio thread. (as long as callback processing is pushed > into a separate kernel thread.) > > so RCU will be pretty much the only mechanism (besides lock-free code) > that allows reader concurrency on PREEMPT_RT. This is a relief! I was wondering how on earth I was going to solve the multi-task priority-inheritance problem! But... How do we handle the following scenario? 0. A bunch of low-priority threads are preempted in the middle of various RCU read-side critical sections. 1. High-priority thread does kmalloc(), but there is no memory, so it blocks. 2. OOM handling notices, and decides to clean up the outstanding RCU callbacks. It therefore invokes _synchronize_kernel() (in implementation #5). 3. The _synchronize_kernel() function tries to acquire all of the read-side locks, which are held by numerous preempted low-priority threads. 4. What now??? Or does the current patch do priority inheritance across the memory allocator? In other words, can we get away with ignoring this one? ;-) Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote: > On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: > > 5. Scalability -and- Realtime Response. > ... > > > void > > rcu_read_lock(void) > > { > > preempt_disable(); > > if (current->rcu_read_lock_nesting++ == 0) { > > current->rcu_read_lock_ptr = > > &__get_cpu_var(rcu_data).lock; > > read_lock(current->rcu_read_lock_ptr); > > } > > preempt_enable(); > > } > > Ok, here's a rather unsure question... > > Uh, is that a sleep violation if that is exclusively held since it > can block within an atomic critical section (deadlock) ? I'd like to note another problem. Mingo's current implementation of rt_mutex (super mutex for all blocking synchronization) is still missing reader counts and something like that would have to be implemented if you want to do priority inheritance over blocks. This is going to throw a wrench into your implementation if you assume that. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: > 5. Scalability -and- Realtime Response. ... > void > rcu_read_lock(void) > { > preempt_disable(); > if (current->rcu_read_lock_nesting++ == 0) { > current->rcu_read_lock_ptr = > &__get_cpu_var(rcu_data).lock; > read_lock(current->rcu_read_lock_ptr); > } > preempt_enable(); > } Ok, here's a rather unsure question... Uh, is that a sleep violation if that is exclusively held since it can block within an atomic critical section (deadlock) ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney <[EMAIL PROTECTED]> wrote: > 5. Scalability -and- Realtime Response. > > The trick here is to keep track of the CPU that did the > rcu_read_lock() in the task structure. If there is a preemption, > there will be some slight inefficiency, but the correct lock will be > released, preserving correctness. the inefficiency will be larger, because (as explained in a previous mail) multiple concurrent owners of the read-lock are not allowed. This adds to the overhead of PREEMPT_RT on SMP, but is an intentional tradeoff. I dont expect PREEMPT_RT to be used on an Altix :-| #5 is still the best option, because in the normal 'infrequent preemption' case it will behave in a cache-friendly way. A positive effect of the lock serializing is that the callback backlog will stay relatively small and that the RCU backlog processing can be made deterministic as well (if needed), by making the backlog processing thread(s) SCHED_FIFO. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > [...] How about something like: > > void > rcu_read_lock(void) > { > preempt_disable(); > if (current->rcu_read_lock_nesting++ == 0) { > current->rcu_read_lock_ptr = > &__get_cpu_var(rcu_data).lock; > preempt_enable(); > read_lock(current->rcu_read_lock_ptr); > } else > preempt_enable(); > } > > this would still make it 'statistically scalable' - but is it correct? thinking some more about it, i believe it's correct, because it picks one particular CPU's lock and correctly releases that lock. (read_unlock() is atomic even on PREEMPT_RT, so rcu_read_unlock() is fine.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
there's a problem in #5's rcu_read_lock(): void rcu_read_lock(void) { preempt_disable(); if (current->rcu_read_lock_nesting++ == 0) { current->rcu_read_lock_ptr = &__get_cpu_var(rcu_data).lock; read_lock(current->rcu_read_lock_ptr); } preempt_enable(); } not only are read_lock()-ed sections preemptible, read_lock() itself may block, so it cannot be called from within preempt_disable(). How about something like: void rcu_read_lock(void) { preempt_disable(); if (current->rcu_read_lock_nesting++ == 0) { current->rcu_read_lock_ptr = &__get_cpu_var(rcu_data).lock; preempt_enable(); read_lock(current->rcu_read_lock_ptr); } else preempt_enable(); } this would still make it 'statistically scalable' - but is it correct? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > there's one detail on PREEMPT_RT though (which i think you noticed too). > > Priority inheritance handling can be done in a pretty straightforward > way as long as no true read-side nesting is allowed for rwsems and > rwlocks - i.e. there's only one owner of a lock at a time. So > PREEMPT_RT restricts rwsem and rwlock concurrency: readers are > writers, with the only exception that they are allowed to 'self-nest'. > [...] this does not affect read-side RCU, because read-side RCU can never block a higher-prio thread. (as long as callback processing is pushed into a separate kernel thread.) so RCU will be pretty much the only mechanism (besides lock-free code) that allows reader concurrency on PREEMPT_RT. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > * Paul E. McKenney <[EMAIL PROTECTED]> wrote: > > > 4. Preemptible read side. > > > > RCU read-side critical sections can (in theory, anyway) be quite > > large, degrading realtime scheduling response. Preemptible RCU > > read-side critical sections avoid such degradation. Manual > > insertion of "preemption points" might be an alternative as well. > > But I have no intention of trying to settle the long-running > > argument between proponents of preemption and of preemption > > points. Not today, anyway! ;-) > > i'm cleverly sidestepping that argument by offering 4 preemption > models in the -RT tree :-) We dont have to pick a winner, users will. > [...] also, it turned out that "preemption points" vs. "forced preemption" are not exclusive concepts: PREEMPT_RT relies on _both_ of them. when a highprio task tries to acquire a lock that another, lower-prio task already holds, then 'the time it takes for the lowprio task to drop the lock' directly depends on the frequency of explicit preemption points within the critical section. So to get good 'lock dependent' latencies on PREEMPT_RT we need both a good distribution of preemption points (within locked sections). (obviously, 'lock independent' preemption latencies purely depends on the quality of forced preemption and the size of non-preempt critical sections, not on any explicit preemption point.) dont we love seemingly conflicting concepts that end up helping each other? It's a nice flamewar-killer ;-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * Paul E. McKenney <[EMAIL PROTECTED]> wrote: > > > I have tested this approach, but in user-level scaffolding. All of > > these implementations should therefore be regarded with great > > suspicion: untested, probably don't even compile. Besides which, I > > certainly can't claim to fully understand the real-time preempt patch, > > so I am bound to have gotten something wrong somewhere. [...] > > you dont even have to consider the -RT patchset: if the scheme allows > forced preemption of read-side RCU sections on current upstream > CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too. there's one detail on PREEMPT_RT though (which i think you noticed too). Priority inheritance handling can be done in a pretty straightforward way as long as no true read-side nesting is allowed for rwsems and rwlocks - i.e. there's only one owner of a lock at a time. So PREEMPT_RT restricts rwsem and rwlock concurrency: readers are writers, with the only exception that they are allowed to 'self-nest'. I.e. things like: read_lock(); ... read_lock(); are still legal. (it's also done quite often.) (it is virtually impossible to implement priority inheritance for true multi-reader locks in any sane way: i've done it initially and it sucks very much. It also fundamentally increases the 'lock-dependent' worst-case latencies - imagine 4 readers having to finish first if a higher-prio writer comes along. It's insane.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney <[EMAIL PROTECTED]> wrote: > I have tested this approach, but in user-level scaffolding. All of > these implementations should therefore be regarded with great > suspicion: untested, probably don't even compile. Besides which, I > certainly can't claim to fully understand the real-time preempt patch, > so I am bound to have gotten something wrong somewhere. [...] you dont even have to consider the -RT patchset: if the scheme allows forced preemption of read-side RCU sections on current upstream CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney <[EMAIL PROTECTED]> wrote: > 4. Preemptible read side. > > RCU read-side critical sections can (in theory, anyway) be quite > large, degrading realtime scheduling response. Preemptible RCU > read-side critical sections avoid such degradation. Manual > insertion of "preemption points" might be an alternative as well. > But I have no intention of trying to settle the long-running > argument between proponents of preemption and of preemption > points. Not today, anyway! ;-) i'm cleverly sidestepping that argument by offering 4 preemption models in the -RT tree :-) We dont have to pick a winner, users will. The way low latencies are achieved depends on the preemption model: ( ) No Forced Preemption (Server) ( ) Voluntary Kernel Preemption (Desktop) ( ) Preemptible Kernel (Low-Latency Desktop) (X) Complete Preemption (Real-Time) "Server" is the current default !PREEMPT model. Best throughput, bad worst-case latencies. "Low-Latency Desktop" is the current PREEMPT model. Has some runtime overhead relative to Server, offers fair worst-case latencies. "Desktop" is a new mode that is somewhere between Server and Low-Latency Desktop: it's what was initially called 'voluntary preemption'. It doesnt make the kernel forcibly preemptible anywhere, but inserts a fair number of preemption points to decrease latencies statistically, while keeping the runtime overhead close to the "Server". (a variant of this model is utilized by Fedora and RHEL4 currently, so if this gets upstream i expect distros to pick it up - it can be a migration helper towards the "Low-Latency Desktop" model.) "Real-Time" is the no-compromises hard-RT model: almost everything but the scheduler itself is fully preemptible. Phenomenal worst-case latencies in every workload scenario, but also has the highest runtime overhead. preemptable RCU makes sense for the "Low-Latency Desktop" and "Real-Time" preemption models - these are the ones that do forced preemption. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney <[EMAIL PROTECTED]> wrote: > [I believe that the real-time preemption patch moves code > from softirq/interrupt to process context, but could easily be > missing something. If need be, there are ways of handling cases > were realtime RCU is called from softirq and interrupt contexts, > for an example approach, see the Quick Quiz answers.] correct, on PREEMPT_RT, IRQ processing is (and must be) in process context. (it's pretty much a must-have thing to enable the preemption of irq handlers and softirq processing: if irq/softirq contexts nested on the kernel stack then irq contexts would 'pin down' the process context that was interrupted, so even if irq preemption was possible, the underlying process context couldnt be freely scheduled.) still, it's nice that your #5 design does not rely on it - it's a sign of fundamental robustness, plus we could (in theory) offer preemptable RCU on PREEMPT_DESKTOP (== current PREEMPT) as well - just like there is PREEMPT_BKL. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney [EMAIL PROTECTED] wrote: [I believe that the real-time preemption patch moves code from softirq/interrupt to process context, but could easily be missing something. If need be, there are ways of handling cases were realtime RCU is called from softirq and interrupt contexts, for an example approach, see the Quick Quiz answers.] correct, on PREEMPT_RT, IRQ processing is (and must be) in process context. (it's pretty much a must-have thing to enable the preemption of irq handlers and softirq processing: if irq/softirq contexts nested on the kernel stack then irq contexts would 'pin down' the process context that was interrupted, so even if irq preemption was possible, the underlying process context couldnt be freely scheduled.) still, it's nice that your #5 design does not rely on it - it's a sign of fundamental robustness, plus we could (in theory) offer preemptable RCU on PREEMPT_DESKTOP (== current PREEMPT) as well - just like there is PREEMPT_BKL. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney [EMAIL PROTECTED] wrote: 4. Preemptible read side. RCU read-side critical sections can (in theory, anyway) be quite large, degrading realtime scheduling response. Preemptible RCU read-side critical sections avoid such degradation. Manual insertion of preemption points might be an alternative as well. But I have no intention of trying to settle the long-running argument between proponents of preemption and of preemption points. Not today, anyway! ;-) i'm cleverly sidestepping that argument by offering 4 preemption models in the -RT tree :-) We dont have to pick a winner, users will. The way low latencies are achieved depends on the preemption model: ( ) No Forced Preemption (Server) ( ) Voluntary Kernel Preemption (Desktop) ( ) Preemptible Kernel (Low-Latency Desktop) (X) Complete Preemption (Real-Time) Server is the current default !PREEMPT model. Best throughput, bad worst-case latencies. Low-Latency Desktop is the current PREEMPT model. Has some runtime overhead relative to Server, offers fair worst-case latencies. Desktop is a new mode that is somewhere between Server and Low-Latency Desktop: it's what was initially called 'voluntary preemption'. It doesnt make the kernel forcibly preemptible anywhere, but inserts a fair number of preemption points to decrease latencies statistically, while keeping the runtime overhead close to the Server. (a variant of this model is utilized by Fedora and RHEL4 currently, so if this gets upstream i expect distros to pick it up - it can be a migration helper towards the Low-Latency Desktop model.) Real-Time is the no-compromises hard-RT model: almost everything but the scheduler itself is fully preemptible. Phenomenal worst-case latencies in every workload scenario, but also has the highest runtime overhead. preemptable RCU makes sense for the Low-Latency Desktop and Real-Time preemption models - these are the ones that do forced preemption. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney [EMAIL PROTECTED] wrote: I have tested this approach, but in user-level scaffolding. All of these implementations should therefore be regarded with great suspicion: untested, probably don't even compile. Besides which, I certainly can't claim to fully understand the real-time preempt patch, so I am bound to have gotten something wrong somewhere. [...] you dont even have to consider the -RT patchset: if the scheme allows forced preemption of read-side RCU sections on current upstream CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Ingo Molnar [EMAIL PROTECTED] wrote: * Paul E. McKenney [EMAIL PROTECTED] wrote: I have tested this approach, but in user-level scaffolding. All of these implementations should therefore be regarded with great suspicion: untested, probably don't even compile. Besides which, I certainly can't claim to fully understand the real-time preempt patch, so I am bound to have gotten something wrong somewhere. [...] you dont even have to consider the -RT patchset: if the scheme allows forced preemption of read-side RCU sections on current upstream CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too. there's one detail on PREEMPT_RT though (which i think you noticed too). Priority inheritance handling can be done in a pretty straightforward way as long as no true read-side nesting is allowed for rwsems and rwlocks - i.e. there's only one owner of a lock at a time. So PREEMPT_RT restricts rwsem and rwlock concurrency: readers are writers, with the only exception that they are allowed to 'self-nest'. I.e. things like: read_lock(rwlock); ... read_lock(rwlock); are still legal. (it's also done quite often.) (it is virtually impossible to implement priority inheritance for true multi-reader locks in any sane way: i've done it initially and it sucks very much. It also fundamentally increases the 'lock-dependent' worst-case latencies - imagine 4 readers having to finish first if a higher-prio writer comes along. It's insane.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Ingo Molnar [EMAIL PROTECTED] wrote: * Paul E. McKenney [EMAIL PROTECTED] wrote: 4. Preemptible read side. RCU read-side critical sections can (in theory, anyway) be quite large, degrading realtime scheduling response. Preemptible RCU read-side critical sections avoid such degradation. Manual insertion of preemption points might be an alternative as well. But I have no intention of trying to settle the long-running argument between proponents of preemption and of preemption points. Not today, anyway! ;-) i'm cleverly sidestepping that argument by offering 4 preemption models in the -RT tree :-) We dont have to pick a winner, users will. [...] also, it turned out that preemption points vs. forced preemption are not exclusive concepts: PREEMPT_RT relies on _both_ of them. when a highprio task tries to acquire a lock that another, lower-prio task already holds, then 'the time it takes for the lowprio task to drop the lock' directly depends on the frequency of explicit preemption points within the critical section. So to get good 'lock dependent' latencies on PREEMPT_RT we need both a good distribution of preemption points (within locked sections). (obviously, 'lock independent' preemption latencies purely depends on the quality of forced preemption and the size of non-preempt critical sections, not on any explicit preemption point.) dont we love seemingly conflicting concepts that end up helping each other? It's a nice flamewar-killer ;-) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Ingo Molnar [EMAIL PROTECTED] wrote: there's one detail on PREEMPT_RT though (which i think you noticed too). Priority inheritance handling can be done in a pretty straightforward way as long as no true read-side nesting is allowed for rwsems and rwlocks - i.e. there's only one owner of a lock at a time. So PREEMPT_RT restricts rwsem and rwlock concurrency: readers are writers, with the only exception that they are allowed to 'self-nest'. [...] this does not affect read-side RCU, because read-side RCU can never block a higher-prio thread. (as long as callback processing is pushed into a separate kernel thread.) so RCU will be pretty much the only mechanism (besides lock-free code) that allows reader concurrency on PREEMPT_RT. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
there's a problem in #5's rcu_read_lock(): void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } preempt_enable(); } not only are read_lock()-ed sections preemptible, read_lock() itself may block, so it cannot be called from within preempt_disable(). How about something like: void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; preempt_enable(); read_lock(current-rcu_read_lock_ptr); } else preempt_enable(); } this would still make it 'statistically scalable' - but is it correct? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Ingo Molnar [EMAIL PROTECTED] wrote: [...] How about something like: void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; preempt_enable(); read_lock(current-rcu_read_lock_ptr); } else preempt_enable(); } this would still make it 'statistically scalable' - but is it correct? thinking some more about it, i believe it's correct, because it picks one particular CPU's lock and correctly releases that lock. (read_unlock() is atomic even on PREEMPT_RT, so rcu_read_unlock() is fine.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney [EMAIL PROTECTED] wrote: 5. Scalability -and- Realtime Response. The trick here is to keep track of the CPU that did the rcu_read_lock() in the task structure. If there is a preemption, there will be some slight inefficiency, but the correct lock will be released, preserving correctness. the inefficiency will be larger, because (as explained in a previous mail) multiple concurrent owners of the read-lock are not allowed. This adds to the overhead of PREEMPT_RT on SMP, but is an intentional tradeoff. I dont expect PREEMPT_RT to be used on an Altix :-| #5 is still the best option, because in the normal 'infrequent preemption' case it will behave in a cache-friendly way. A positive effect of the lock serializing is that the callback backlog will stay relatively small and that the RCU backlog processing can be made deterministic as well (if needed), by making the backlog processing thread(s) SCHED_FIFO. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: 5. Scalability -and- Realtime Response. ... void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } preempt_enable(); } Ok, here's a rather unsure question... Uh, is that a sleep violation if that is exclusively held since it can block within an atomic critical section (deadlock) ? bill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote: On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: 5. Scalability -and- Realtime Response. ... void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } preempt_enable(); } Ok, here's a rather unsure question... Uh, is that a sleep violation if that is exclusively held since it can block within an atomic critical section (deadlock) ? I'd like to note another problem. Mingo's current implementation of rt_mutex (super mutex for all blocking synchronization) is still missing reader counts and something like that would have to be implemented if you want to do priority inheritance over blocks. This is going to throw a wrench into your implementation if you assume that. bill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 10:53:27AM +0100, Ingo Molnar wrote: * Ingo Molnar [EMAIL PROTECTED] wrote: there's one detail on PREEMPT_RT though (which i think you noticed too). Priority inheritance handling can be done in a pretty straightforward way as long as no true read-side nesting is allowed for rwsems and rwlocks - i.e. there's only one owner of a lock at a time. So PREEMPT_RT restricts rwsem and rwlock concurrency: readers are writers, with the only exception that they are allowed to 'self-nest'. [...] this does not affect read-side RCU, because read-side RCU can never block a higher-prio thread. (as long as callback processing is pushed into a separate kernel thread.) so RCU will be pretty much the only mechanism (besides lock-free code) that allows reader concurrency on PREEMPT_RT. This is a relief! I was wondering how on earth I was going to solve the multi-task priority-inheritance problem! But... How do we handle the following scenario? 0. A bunch of low-priority threads are preempted in the middle of various RCU read-side critical sections. 1. High-priority thread does kmalloc(), but there is no memory, so it blocks. 2. OOM handling notices, and decides to clean up the outstanding RCU callbacks. It therefore invokes _synchronize_kernel() (in implementation #5). 3. The _synchronize_kernel() function tries to acquire all of the read-side locks, which are held by numerous preempted low-priority threads. 4. What now??? Or does the current patch do priority inheritance across the memory allocator? In other words, can we get away with ignoring this one? ;-) Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 11:03:39AM +0100, Ingo Molnar wrote: there's a problem in #5's rcu_read_lock(): void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } preempt_enable(); } not only are read_lock()-ed sections preemptible, read_lock() itself may block, so it cannot be called from within preempt_disable(). How about something like: void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; preempt_enable(); read_lock(current-rcu_read_lock_ptr); } else preempt_enable(); } this would still make it 'statistically scalable' - but is it correct? Good catch! Also good question... Strictly speaking, it is not necessary to block callback invocation until just after rcu_read_lock() returns. It is correct as long as there is no sort of upcall or callback that can masquerade as this task. I know of no such thing in the Linux kernel. In fact such a thing would break a lot of code, right? Any tool that relied on the -rcu_read_lock_nesting counter to deduce RCU state would be confused by this change, but there might be other ways of handling this. Also, we are currently making do without such a tool. It should be possible to move the preempt_enable() further forward ahead of the assignment to -rcu_read_lock_ptr, since the assignment to -rcu_read_lock_ptr is strictly local. Not sure that this is worthwhile, thoughts? void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { preempt_enable(); current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } else preempt_enable(); } The other question is whether preempt_disable() is needed in the first place. The two task-structure fields are not accessed except by the task itself. I bet that the following is just fine: void rcu_read_lock(void) { if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } } Thoughts? Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote: On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: 5. Scalability -and- Realtime Response. ... void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } preempt_enable(); } Ok, here's a rather unsure question... Uh, is that a sleep violation if that is exclusively held since it can block within an atomic critical section (deadlock) ? Hey, I wasn't joking when I said that I probably got something wrong! ;-) My current thought is that the preempt_disable()/preempt_enable() can be dropped entirely. Messes up any tool that browses through -rcu_read_lock_nesting, but don't see any other problem. Yet, anyway! Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 05:17:29AM -0800, Bill Huey wrote: On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote: On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: 5. Scalability -and- Realtime Response. ... void rcu_read_lock(void) { preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } preempt_enable(); } Ok, here's a rather unsure question... Uh, is that a sleep violation if that is exclusively held since it can block within an atomic critical section (deadlock) ? I'd like to note another problem. Mingo's current implementation of rt_mutex (super mutex for all blocking synchronization) is still missing reader counts and something like that would have to be implemented if you want to do priority inheritance over blocks. This is going to throw a wrench into your implementation if you assume that. If we need to do priority inheritance across the memory allocator, so that high-priority tasks blocked waiting for memory pass their priority on to RCU readers, agreed. But I don't see any sign that real-time preempt does this. In absence of this, as Ingo noted, the fact that readers don't block each other should make things be safe. I think... Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Paul E. McKenney [EMAIL PROTECTED] wrote: preempt_disable(); if (current-rcu_read_lock_nesting++ == 0) { current-rcu_read_lock_ptr = __get_cpu_var(rcu_data).lock; read_lock(current-rcu_read_lock_ptr); } preempt_enable(); My current thought is that the preempt_disable()/preempt_enable() can be dropped entirely. Messes up any tool that browses through -rcu_read_lock_nesting, but don't see any other problem. Yet, anyway! yeah - this sounds good. (We are not aiming for irq-safe RCU anyway, on PREEMPT_RT.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
* Bill Huey [EMAIL PROTECTED] wrote: I'd like to note another problem. Mingo's current implementation of rt_mutex (super mutex for all blocking synchronization) is still missing reader counts and something like that would have to be implemented if you want to do priority inheritance over blocks. i really have no intention to allow multiple readers for rt-mutexes. We got away with that so far, and i'd like to keep it so. Imagine 100 threads all blocked in the same critical section (holding the read-lock) when a highprio writer thread comes around: instant 100x latency to let all of them roll forward. The only sane solution is to not allow excessive concurrency. (That limits SMP scalability, but there's no other choice i can see.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 08:49:03AM +0100, Ingo Molnar wrote: * Paul E. McKenney [EMAIL PROTECTED] wrote: Seems to me that it would be good to have an RCU implementation that meet the requirements of the Real-Time Preemption patch, but that is 100% compatible with the classic RCU API. Such an implementation must meet a number of requirements, which are listed at the end of this message (search for REQUIREMENTS). [ Wow. you must be a secret telepatic mind-reader - yesterday i was thinking about mailing you, because my approach to RCU preemptability (the API variants) clearly sucked and caused problems all around, both in terms of maintainability and in terms of stability and scalability. ] Wish I could claim that, but must credit dumb luck. ;-) 5. The final version, which both scales and meets realtime requirements, as well as exactly matching the classic RCU API. I have tested this approach, but in user-level scaffolding. All of these implementations should therefore be regarded with great suspicion: untested, probably don't even compile. Besides which, I certainly can't claim to fully understand the real-time preempt patch, so I am bound to have gotten something wrong somewhere. In any case, none of these implementations are a suitable replacement for classic RCU on large servers, since they acquire locks in the RCU read-side critical sections. However, they should scale enough to support small SMP systems, inflicting only a modest performance penalty. basically for PREEMPT_RT the only constraint is that RCU sections should be preemptable. Whatever the performance cost. If PREEMPT_RT is merged into the upstream kernel then it will (at least initially) be at a status similar to NOMMU: it will be tolerated as long as it causes no 'drag' on the main code. The RCU API variants i introduced clearly violated this requirement, and were my #1 worry wrt. upstream mergability. Fair enough! I believe that implementation #5 is most appropriate for real-time preempt kernels. [...] yeah, agreed - it looks perfect - both the read and write side is preemptable. Can i just plug the code you sent into rcupudate.c and expect it to work, or would you like to send a patch? If you prefer you can make it an unconditional patch against an upstream kernel to keep things simple for you - i'll then massage it to be properly PREEMPT_RT dependent. I will give it a shot, but feel free to grab anything at any point, starting either from the patch I will send in a bit or from my earlier email. Do whatever works best for you. For the patch, here are my questions: o What is the best way to select between classic RCU and this scheme? 1. Massive #ifdef across rcupdate.c 2. Create an rcupdate_rt.c and browbeat the build system into picking one or the other (no clue if this is possible...) 3. Create an rcupdate_rt.c and rely on the linker to pick one or the other, with rcupdate.h generating different external symbol names to make the choice. Left to myself, I would grit my teeth and do #1, since I understand how to do it. Flames welcome, as long as accompanied by good advice! o How best to interface to OOM? Left to myself, I leave this for later. ;-) I will take the cowardly approach of patching against the upstream kernel. [...] In theory, #3 might be appropriate, but if I understand the real-time preempt implementation of reader-writer lock, it will not perform well if there are long RCU read-side critical sections, even in UP kernels. all RCU-locked sections must be preemptable in -RT. Basically RCU is a mainstream API that is used by lots of code and will be introduced in many other areas as well. From the -RT kernel's POV sees this as an 'uncontrollable latency source', which keeps introducing critical sections. One major goal of PREEMPT_RT is to convert all popular critical section APIs into preemptible sections, so that the amount of code that is non-preemptable is drastically reduced and can be managed (and thus can be trusted). This goal has a higher priority than any performance consideration, because it doesnt matter what performance you have, if you cannot trust the kernel to be deterministic. Makes sense to me! Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/