Re: Real-Time Preemption and RCU

2005-03-23 Thread Paul E. McKenney
On Wed, Mar 23, 2005 at 12:44:52PM +0100, Esben Nielsen wrote:
> On Tue, 22 Mar 2005, Paul E. McKenney wrote:
> 
> > On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
> > > On Mon, 21 Mar 2005, Paul E. McKenney wrote:
> > [ . . . ]
> > > > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> > > > This is in some ways similar to the K42 approach to RCU (which they call
> > > > "generations").  Dipankar put together a similar patch for Linux, but
> > > > the problem was that grace periods could be deferred for an extremely
> > > > long time.  Which I suspect is what you were calling out as causing
> > > > RCU batches never to run.
> > > 
> > > That is where the preempt_by_nonrt_disable/enable() is supposed to help:
> > > Then it can't take longer than the normal kernel in the situation where
> > > there is no RT tasks running. RT tasks will prolong the grace periods if
> > > they go into RCU regions, but they are supposed to be relatively small -
> > > and deterministic!
> > 
> > The part that I am missing is how this helps in the case where a non-RT
> > task gets preempted in the middle of an RCU read-side critical section
> > indefinitely.  Or are you boosting the priority of any task that
> > enters an RCU read-side critical section?
> 
> Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am
> playing around (when I get time for it that is :-( ) with cheaper
> solution: 
> I assume you enter these regions where you don't want to be
> preempted by non-RT tasks are relatively short. Therefore the risc of
> getting preempted is small. Moving the priority is expensive since you
> need to lock the runqueue. I only want to do the movement when
> there is an preemption. Therefore I added code in schedule() to take care
> of it: If a task is in a rcu-read section, is non-RT and is preempted it's
> priority is set to MAX_RT_PRIO for the time being. It will keep that
> priority until the priority is recalculated, but that shouldn't hurt
> anyone. 
> I am not happy about adding code to schedule() but setting the
> priority in there is very cheap because it already has the lock
> on the runqueue. Furthermore, I assume it only happens very rarely. In the
> execution of schedule() my code only takes a single test on wether the
> previous task was in a rcu-section or not. That is not very much code.

Interesting approach -- could come in handy.

> I have not yet tested it (no time :-( )

Well, being as I haven't got the lock-based scheme fully running yet,
I can't give you too much trouble about that.  :-/

Thanx, Paul

> > [...]
> > > > Yes, but this is true of every other lock in the system as well, not?
> > > 
> > > Other locks are not globaly used but only used for a specific subsystem.
> > > On a real-time system you are supposed to know which subsystems you can
> > > call into and still have a low enough latency as each subsystem has it's
> > > own bound. But with a global RCU locking mechanism all RCU using code is
> > > to be regarded as _one_ such subsystem.
> > 
> > Yep.  As would the things protected by the dcache lock, task list lock,
> > and so on, right?
> 
> Yep
> 
> > 
> > Thanx, Paul
> > 
> Esben
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-23 Thread Esben Nielsen
On Tue, 22 Mar 2005, Paul E. McKenney wrote:

> On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
> > On Mon, 21 Mar 2005, Paul E. McKenney wrote:
> [ . . . ]
> > > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> > > This is in some ways similar to the K42 approach to RCU (which they call
> > > "generations").  Dipankar put together a similar patch for Linux, but
> > > the problem was that grace periods could be deferred for an extremely
> > > long time.  Which I suspect is what you were calling out as causing
> > > RCU batches never to run.
> > 
> > That is where the preempt_by_nonrt_disable/enable() is supposed to help:
> > Then it can't take longer than the normal kernel in the situation where
> > there is no RT tasks running. RT tasks will prolong the grace periods if
> > they go into RCU regions, but they are supposed to be relatively small -
> > and deterministic!
> 
> The part that I am missing is how this helps in the case where a non-RT
> task gets preempted in the middle of an RCU read-side critical section
> indefinitely.  Or are you boosting the priority of any task that
> enters an RCU read-side critical section?

Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am
playing around (when I get time for it that is :-( ) with cheaper
solution: 
I assume you enter these regions where you don't want to be
preempted by non-RT tasks are relatively short. Therefore the risc of
getting preempted is small. Moving the priority is expensive since you
need to lock the runqueue. I only want to do the movement when
there is an preemption. Therefore I added code in schedule() to take care
of it: If a task is in a rcu-read section, is non-RT and is preempted it's
priority is set to MAX_RT_PRIO for the time being. It will keep that
priority until the priority is recalculated, but that shouldn't hurt
anyone. 
I am not happy about adding code to schedule() but setting the
priority in there is very cheap because it already has the lock
on the runqueue. Furthermore, I assume it only happens very rarely. In the
execution of schedule() my code only takes a single test on wether the
previous task was in a rcu-section or not. That is not very much code.

I have not yet tested it (no time :-( )


> [...]
> > > Yes, but this is true of every other lock in the system as well, not?
> > 
> > Other locks are not globaly used but only used for a specific subsystem.
> > On a real-time system you are supposed to know which subsystems you can
> > call into and still have a low enough latency as each subsystem has it's
> > own bound. But with a global RCU locking mechanism all RCU using code is
> > to be regarded as _one_ such subsystem.
> 
> Yep.  As would the things protected by the dcache lock, task list lock,
> and so on, right?

Yep

> 
>   Thanx, Paul
> 
Esben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-23 Thread Esben Nielsen
On Tue, 22 Mar 2005, Paul E. McKenney wrote:

 On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
  On Mon, 21 Mar 2005, Paul E. McKenney wrote:
 [ . . . ]
   On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
   This is in some ways similar to the K42 approach to RCU (which they call
   generations).  Dipankar put together a similar patch for Linux, but
   the problem was that grace periods could be deferred for an extremely
   long time.  Which I suspect is what you were calling out as causing
   RCU batches never to run.
  
  That is where the preempt_by_nonrt_disable/enable() is supposed to help:
  Then it can't take longer than the normal kernel in the situation where
  there is no RT tasks running. RT tasks will prolong the grace periods if
  they go into RCU regions, but they are supposed to be relatively small -
  and deterministic!
 
 The part that I am missing is how this helps in the case where a non-RT
 task gets preempted in the middle of an RCU read-side critical section
 indefinitely.  Or are you boosting the priority of any task that
 enters an RCU read-side critical section?

Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am
playing around (when I get time for it that is :-( ) with cheaper
solution: 
I assume you enter these regions where you don't want to be
preempted by non-RT tasks are relatively short. Therefore the risc of
getting preempted is small. Moving the priority is expensive since you
need to lock the runqueue. I only want to do the movement when
there is an preemption. Therefore I added code in schedule() to take care
of it: If a task is in a rcu-read section, is non-RT and is preempted it's
priority is set to MAX_RT_PRIO for the time being. It will keep that
priority until the priority is recalculated, but that shouldn't hurt
anyone. 
I am not happy about adding code to schedule() but setting the
priority in there is very cheap because it already has the lock
on the runqueue. Furthermore, I assume it only happens very rarely. In the
execution of schedule() my code only takes a single test on wether the
previous task was in a rcu-section or not. That is not very much code.

I have not yet tested it (no time :-( )


 [...]
   Yes, but this is true of every other lock in the system as well, not?
  
  Other locks are not globaly used but only used for a specific subsystem.
  On a real-time system you are supposed to know which subsystems you can
  call into and still have a low enough latency as each subsystem has it's
  own bound. But with a global RCU locking mechanism all RCU using code is
  to be regarded as _one_ such subsystem.
 
 Yep.  As would the things protected by the dcache lock, task list lock,
 and so on, right?

Yep

 
   Thanx, Paul
 
Esben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-23 Thread Paul E. McKenney
On Wed, Mar 23, 2005 at 12:44:52PM +0100, Esben Nielsen wrote:
 On Tue, 22 Mar 2005, Paul E. McKenney wrote:
 
  On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
   On Mon, 21 Mar 2005, Paul E. McKenney wrote:
  [ . . . ]
On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
This is in some ways similar to the K42 approach to RCU (which they call
generations).  Dipankar put together a similar patch for Linux, but
the problem was that grace periods could be deferred for an extremely
long time.  Which I suspect is what you were calling out as causing
RCU batches never to run.
   
   That is where the preempt_by_nonrt_disable/enable() is supposed to help:
   Then it can't take longer than the normal kernel in the situation where
   there is no RT tasks running. RT tasks will prolong the grace periods if
   they go into RCU regions, but they are supposed to be relatively small -
   and deterministic!
  
  The part that I am missing is how this helps in the case where a non-RT
  task gets preempted in the middle of an RCU read-side critical section
  indefinitely.  Or are you boosting the priority of any task that
  enters an RCU read-side critical section?
 
 Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am
 playing around (when I get time for it that is :-( ) with cheaper
 solution: 
 I assume you enter these regions where you don't want to be
 preempted by non-RT tasks are relatively short. Therefore the risc of
 getting preempted is small. Moving the priority is expensive since you
 need to lock the runqueue. I only want to do the movement when
 there is an preemption. Therefore I added code in schedule() to take care
 of it: If a task is in a rcu-read section, is non-RT and is preempted it's
 priority is set to MAX_RT_PRIO for the time being. It will keep that
 priority until the priority is recalculated, but that shouldn't hurt
 anyone. 
 I am not happy about adding code to schedule() but setting the
 priority in there is very cheap because it already has the lock
 on the runqueue. Furthermore, I assume it only happens very rarely. In the
 execution of schedule() my code only takes a single test on wether the
 previous task was in a rcu-section or not. That is not very much code.

Interesting approach -- could come in handy.

 I have not yet tested it (no time :-( )

Well, being as I haven't got the lock-based scheme fully running yet,
I can't give you too much trouble about that.  :-/

Thanx, Paul

  [...]
Yes, but this is true of every other lock in the system as well, not?
   
   Other locks are not globaly used but only used for a specific subsystem.
   On a real-time system you are supposed to know which subsystems you can
   call into and still have a low enough latency as each subsystem has it's
   own bound. But with a global RCU locking mechanism all RCU using code is
   to be regarded as _one_ such subsystem.
  
  Yep.  As would the things protected by the dcache lock, task list lock,
  and so on, right?
 
 Yep
 
  
  Thanx, Paul
  
 Esben
 
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Paul E. McKenney
On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
> On Mon, 21 Mar 2005, Paul E. McKenney wrote:
[ . . . ]
> > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> > This is in some ways similar to the K42 approach to RCU (which they call
> > "generations").  Dipankar put together a similar patch for Linux, but
> > the problem was that grace periods could be deferred for an extremely
> > long time.  Which I suspect is what you were calling out as causing
> > RCU batches never to run.
> 
> That is where the preempt_by_nonrt_disable/enable() is supposed to help:
> Then it can't take longer than the normal kernel in the situation where
> there is no RT tasks running. RT tasks will prolong the grace periods if
> they go into RCU regions, but they are supposed to be relatively small -
> and deterministic!

The part that I am missing is how this helps in the case where a non-RT
task gets preempted in the middle of an RCU read-side critical section
indefinitely.  Or are you boosting the priority of any task that
enters an RCU read-side critical section?

> > > > The counter approach might work, and is also what the implementation #5
> > > > does -- check out rcu_read_lock() in Ingo's most recent patch.
> > > > 
> > > 
> > > Do you refer to your original mail with implementing it in 5 steps?
> > > In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
> > > forces syncronization between the CPUs - bad for scaling. It does make the
> > > RCU batches somewhat deterministic, as the RCU task can boost the readers
> > > to the rcu-task's priority.
> > > The problem about this approach is that everybody calling into RCU code
> > > have a worst case behaviour of the systemwide worst case RCU reader 
> > > section - which can be pretty large (in principle infinite if somebody.)
> > > So if somebody uses a call to a function in the code containing a RCU read
> > > area the worst case behavious would be the same as teh worst case latency
> > > in the simple world where preempt_disable()/preempt_enable() was used.
> > 
> > I missed something here -- readers would see the worst-case -writer-
> > latency rather than the worst-case -reader- latency, right?  Or are you
> > concerned about the case where some reader blocks the write-side
> > acquisitions in _synchronize_kernel(), but not before the writer has
> > grabbed a lock, blocking any readers on the corresponding CPU?
> 
> I am conserned that readers block each other, too. You do need an rw-mutex
> allowing an unlimited number of readers for doing this. With the current
> rw-mutex the readers block each other. I.e. the worst case latency is the
> worst case reader latency - globally!
> On the other hand with a rw-lock being unlimited - and thus do not keep
> track of it readers - the readers can't be boosted by the writer. Then you
> are back to square 1: The grace period can take a very long time.

OK, good point.

> > Yes, but this is true of every other lock in the system as well, not?
> 
> Other locks are not globaly used but only used for a specific subsystem.
> On a real-time system you are supposed to know which subsystems you can
> call into and still have a low enough latency as each subsystem has it's
> own bound. But with a global RCU locking mechanism all RCU using code is
> to be regarded as _one_ such subsystem.

Yep.  As would the things protected by the dcache lock, task list lock,
and so on, right?

Thanx, Paul

> > In a separate conversation a few weeks ago, Steve Hemminger suggested
> > placing checks in long RCU read-side critical sections -- this could
> > be used to keep the worst-case reader latency within the desired bounds.
> > Not pretty, but, then again, bounding lock latency will require a fair
> > number of similar changes, I suspect.
> > 
> > Thanx, Paul
> > 
> 
> Esben
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen
On Tue, 22 Mar 2005, Ingo Molnar wrote:

> 
> * Esben Nielsen <[EMAIL PROTECTED]> wrote:
> 
> > > > +static inline void rcu_read_lock(void)
> > > > +{  
> > > > +   preempt_disable(); 
> > > > +   __get_cpu_var(rcu_data).active_readers++;
> > > > +   preempt_enable();
> > > > +}
> > > 
> > > this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
> > > the same CPU, and hence ->active_readers can get out of sync.
> > > 
> > 
> > Ok, this have to be handled in the mitigration code somehow. I have already 
> > added an 
> >   current->rcu_read_depth++
> > so it ought to be painless. A simple solution would be not to
> > mititagrate threads with rcu_read_depth!=0.
> 
> could you elaborate?
> 
Put an rcu_read_depth on each task. In rcu_read_lock() make a 
 current->rcu_read_depth++;
and visa versa in rcu_read_unlock(). In can_migrate_task() add
   if(p->rcu_read_depth)
 return 0;
That might do the trick.
 

> In any case, see the new PREEMPT_RCU code in the -40-07 patch (and
> upwards). I've also attached a separate patch, it should apply cleanly
> to 2.6.12-rc1.
> 

I barely have time to download at the patches - let alone applying them!

Anyway: I find one thing I don't like: using atomic_inc()/dec() in
rcu_read_lock()/unlock() to touch rcu_data which might be on another
CPU. Then rcu_data is not really per-CPU data anymore and it also hurts
performance of RCU readers.
I think it will be cheaper to use the above rcu_read_depth and then either
not migrate tasks at all or make the migrate code take care of migrating
the rcu_read_depth count to the new CPU - one would have to take care to
increment it in the rcu_data of the new CPU on the new CPU (it isn't
atomic) and then decrement it in the rcu_data of the old CPU on the old
CPU - in that order.


>   Ingo
> 


Esben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Ingo Molnar

* Esben Nielsen <[EMAIL PROTECTED]> wrote:

> > > +static inline void rcu_read_lock(void)
> > > +{
> > > + preempt_disable(); 
> > > + __get_cpu_var(rcu_data).active_readers++;
> > > + preempt_enable();
> > > +}
> > 
> > this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
> > the same CPU, and hence ->active_readers can get out of sync.
> > 
> 
> Ok, this have to be handled in the mitigration code somehow. I have already 
> added an 
>   current->rcu_read_depth++
> so it ought to be painless. A simple solution would be not to
> mititagrate threads with rcu_read_depth!=0.

could you elaborate?

In any case, see the new PREEMPT_RCU code in the -40-07 patch (and
upwards). I've also attached a separate patch, it should apply cleanly
to 2.6.12-rc1.

Ingo
--- linux/kernel/rcupdate.c.orig
+++ linux/kernel/rcupdate.c
@@ -468,3 +468,36 @@ module_param(maxbatch, int, 0);
 EXPORT_SYMBOL(call_rcu);
 EXPORT_SYMBOL(call_rcu_bh);
 EXPORT_SYMBOL(synchronize_kernel);
+
+#ifdef CONFIG_PREEMPT_RCU
+
+void rcu_read_lock(void)
+{
+   if (current->rcu_read_lock_nesting++ == 0) {
+   current->rcu_data = _cpu_var(rcu_data);
+   atomic_inc(>rcu_data->active_readers);
+   put_cpu_var(rcu_data);
+   }
+}
+EXPORT_SYMBOL(rcu_read_lock);
+
+void rcu_read_unlock(void)
+{
+   int cpu;
+
+   if (--current->rcu_read_lock_nesting == 0) {
+   atomic_dec(>rcu_data->active_readers);
+   /*
+* Check whether we have reached quiescent state.
+* Note! This is only for the local CPU, not for
+* current->rcu_data's CPU [which typically is the
+* current CPU, but may also be another CPU].
+*/
+   cpu = get_cpu();
+   rcu_qsctr_inc(cpu);
+   put_cpu();
+   }
+}
+EXPORT_SYMBOL(rcu_read_unlock);
+
+#endif
--- linux/include/linux/rcupdate.h.orig
+++ linux/include/linux/rcupdate.h
@@ -99,6 +99,9 @@ struct rcu_data {
struct rcu_head *donelist;
struct rcu_head **donetail;
int cpu;
+#ifdef CONFIG_PREEMPT_RCU
+   atomic_tactive_readers;
+#endif
 };
 
 DECLARE_PER_CPU(struct rcu_data, rcu_data);
@@ -115,12 +118,18 @@ extern struct rcu_ctrlblk rcu_bh_ctrlblk
 static inline void rcu_qsctr_inc(int cpu)
 {
struct rcu_data *rdp = _cpu(rcu_data, cpu);
-   rdp->passed_quiesc = 1;
+#ifdef CONFIG_PREEMPT_RCU
+   if (!atomic_read(>active_readers))
+#endif
+   rdp->passed_quiesc = 1;
 }
 static inline void rcu_bh_qsctr_inc(int cpu)
 {
struct rcu_data *rdp = _cpu(rcu_bh_data, cpu);
-   rdp->passed_quiesc = 1;
+#ifdef CONFIG_PREEMPT_RCU
+   if (!atomic_read(>active_readers))
+#endif
+   rdp->passed_quiesc = 1;
 }
 
 static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
@@ -183,14 +192,22 @@ static inline int rcu_pending(int cpu)
  *
  * It is illegal to block while in an RCU read-side critical section.
  */
-#define rcu_read_lock()preempt_disable()
+#ifdef CONFIG_PREEMPT_RCU
+  extern void rcu_read_lock(void);
+#else
+# define rcu_read_lock preempt_disable()
+#endif
 
 /**
  * rcu_read_unlock - marks the end of an RCU read-side critical section.
  *
  * See rcu_read_lock() for more information.
  */
-#define rcu_read_unlock()  preempt_enable()
+#ifdef CONFIG_PREEMPT_RCU
+  extern void rcu_read_unlock(void);
+#else
+# define rcu_read_unlock preempt_enable()
+#endif
 
 /*
  * So where is rcu_write_lock()?  It does not exist, as there is no
@@ -213,14 +230,16 @@ static inline int rcu_pending(int cpu)
  * can use just rcu_read_lock().
  *
  */
-#define rcu_read_lock_bh() local_bh_disable()
+//#define rcu_read_lock_bh()   local_bh_disable()
+#define rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
 
 /*
  * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
  *
  * See rcu_read_lock_bh() for more information.
  */
-#define rcu_read_unlock_bh()   local_bh_enable()
+//#define rcu_read_unlock_bh() local_bh_enable()
+#define rcu_read_unlock_bh()   { local_bh_enable(); rcu_read_unlock(); }
 
 /**
  * rcu_dereference - fetch an RCU-protected pointer in an
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -727,6 +872,10 @@ struct task_struct {
nodemask_t mems_allowed;
int cpuset_mems_generation;
 #endif
+#ifdef CONFIG_PREEMPT_RCU
+   int rcu_read_lock_nesting;
+   struct rcu_data *rcu_data;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen
On Tue, 22 Mar 2005, Ingo Molnar wrote:

> 
> * Esben Nielsen <[EMAIL PROTECTED]> wrote:
> 
> > +static inline void rcu_read_lock(void)
> > +{  
> > +   preempt_disable(); 
> > +   __get_cpu_var(rcu_data).active_readers++;
> > +   preempt_enable();
> > +}
> 
> this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
> the same CPU, and hence ->active_readers can get out of sync.
> 

Ok, this have to be handled in the mitigration code somehow. I have already 
added an 
  current->rcu_read_depth++
so it ought to be painless. A simple solution would be not to mititagrate
threads with rcu_read_depth!=0.

>   Ingo
> 

Esben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Ingo Molnar

* Esben Nielsen <[EMAIL PROTECTED]> wrote:

> +static inline void rcu_read_lock(void)
> +{
> + preempt_disable(); 
> + __get_cpu_var(rcu_data).active_readers++;
> + preempt_enable();
> +}

this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
the same CPU, and hence ->active_readers can get out of sync.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen
On Tue, 22 Mar 2005, Bill Huey wrote:

> On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote:
> > On Fri, 18 Mar 2005, Ingo Molnar wrote:
> > > i really have no intention to allow multiple readers for rt-mutexes. We
> > > got away with that so far, and i'd like to keep it so. Imagine 100
> > > threads all blocked in the same critical section (holding the read-lock)
> > > when a highprio writer thread comes around: instant 100x latency to let
> > > all of them roll forward. The only sane solution is to not allow
> > > excessive concurrency. (That limits SMP scalability, but there's no
> > > other choice i can see.)
> > 
> > Unless a design change is made: One could argue for a semantics where
> > write-locking _isn't_ deterministic and thus do not have to boost all the
> 
> RCU isn't write deterministic like typical RT apps are we can... (below :-))

It is: It takes place right away. But it is not non-deterministic when
_all_ readers actually read it. Also the cleanup is non-deterministic.
So unless you actually _wait_ for the cleanup to happen instead of
defering it you can safely do RCU writes in a RT-task.

> 
> > readers. Readers boost the writers but not the other way around. Readers
> > will be deterministic, but not writers.
> > Such a semantics would probably work for a lot of RT applications
> > happening not to take any write-locks - these will in fact perform better. 
> > But it will give the rest a lot of problems.
> 
> Just came up with an idea after I thought about how much of a bitch it
> would be to get a fast RCU multipule reader semantic (our current shared-
> exclusive lock inserts owners into a sorted priority list per-thread which
> makes it very expensive for a simple RCU case since they are typically very
> small batches of items being altered). Basically the RCU algorithm has *no*
> notion of writer priority and to propagate a PI operation down all reader
> is meaningless, so why not revert back to the original rwlock-semaphore to
> get the multipule reader semantics ?

Remember to boost the writer such RT tasks can enter read regions. I must
also warn against the dangers: A lot of code where a write-lock is taken 
need to marked as non-deterministic, i.e. must not-be called from
RT-tasks (maybe put a WARN_ON(rt_task(current)) in the write-lock
operation?)

> 
> A notion of priority across a quiescience operation is crazy anyways, so
> it would be safe just to use to the old rwlock-semaphore "in place" without
> any changes or priorty handling addtions. The RCU algorithm is only concerned
> with what is basically a coarse data guard and it isn't time or priority
> critical.

I don't find it crazy. I think it is elegant - but also dangerous as it
might take a long time.

> 
> What do you folks think ? That would make Paul's stuff respect multipule
> readers which reduces contention and gets around the problem of possibly
> overloading the current rt lock implementation that we've been bitching
> about. The current RCU development track seem wrong in the first place and
> this seem like it could be a better and more complete solution to the problem.
> 
> If this works, well, you heard it here first. :)
> 
> bill
> 
Esben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread hui
On Tue, Mar 22, 2005 at 02:17:27AM -0800, Bill Huey wrote:
> > A notion of priority across a quiescience operation is crazy anyways[-,-] so
> > it would be safe just to use to the old rwlock-semaphore "in place" without
> > any changes or priorty handling add[i]tions. The RCU algorithm is only 
> > concerned
> > with what is basically a coarse data guard and it isn't time or priority
> > critical.
> 
> A little jitter in a quiescence operation isn't going to hurt things right ?. 

The only thing that I can think of that can go wrong here is what kind
of effect it would have on the thread write blocking against a bunch of
RCU readers. It could introduce a chain of delays into, say, a timer event
and might cause problems/side-effects for other things being processed.
RCU processing might have to decoupled processed by a different thread
to avoid some of that latency weirdness.

What do you folks think ?

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen
On Tue, 22 Mar 2005, Ingo Molnar wrote:

> 
> * Esben Nielsen <[EMAIL PROTECTED]> wrote:
> 
> > On the other hand with a rw-lock being unlimited - and thus do not
> > keep track of it readers - the readers can't be boosted by the writer.
> > Then you are back to square 1: The grace period can take a very long
> > time.
> 
> btw., is the 'very long grace period' a real issue? We could avoid all
> the RCU read-side locking latencies by making it truly unlocked and just
> living with the long grace periods. Perhaps it's enough to add an
> emergency mechanism to the OOM handler, which frees up all the 'blocked
> by preemption' RCU callbacks via some scheduler magic. (e.g. such an
> emergency mechanism could be _conditional_ locking on the read side -
> i.e. new RCU read-side users would be blocked until the OOM situation
> goes away, or something like that.)

You wont catch RCU read-sides already entered - see below.

> 
> your patch is implementing just that, correct? Would you mind redoing it
> against a recent -RT base? (-40-04 or so)
>

In fact I am working on clean 2.6.12-rc1 right now. I figured this is
orthorgonal to the rest RT patch and can probably go upstream immediately.
Seemed to work. I'll try to make into a clean patch soonish and also try
it on -40-04. 
I am trying to make a boosting mechanism in the scheduler such that RCU
readers are boosted to MAX_RT_PRIO when preempted. I have to take it out
first.

Any specific tests I have to run? I am considering making an RCU test
device.

> also, what would be the worst-case workload causing long grace periods?

A nice 19 task, A, enter an RCU region and is preempted. A lot of other
tasks starts running. Then task A might starved for _minuttes_ such that
there is no RCU-grace periods in all that time.

> 
>   Ingo

Esben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread hui
On Tue, Mar 22, 2005 at 02:04:46AM -0800, Bill Huey wrote:
> RCU isn't write deterministic like typical RT apps are[, so] we can... (below 
> :-))
... 
> Just came up with an idea after I thought about how much of a bitch it
> would be to get a fast RCU multipule reader semantic (our current shared-
> exclusive lock inserts owners into a sorted priority list per-thread which
> makes it very expensive for a simple RCU case[,] since they are typically very
> small batches of items being altered). Basically the RCU algorithm has *no*
> notion of writer priority and to propagate a PI operation down all reader[s]
> is meaningless, so why not revert back to the original rwlock-semaphore to
> get the multipule reader semantics ?

The original lock, for those that don't know, doesn't strictly track read owners
so reentrancy is cheap.

> A notion of priority across a quiescience operation is crazy anyways[-,-] so
> it would be safe just to use to the old rwlock-semaphore "in place" without
> any changes or priorty handling add[i]tions. The RCU algorithm is only 
> concerned
> with what is basically a coarse data guard and it isn't time or priority
> critical.

A little jitter in a quiescence operation isn't going to hurt things right ?. 

> What do you folks think ? That would make Paul's stuff respect multipule
> readers which reduces contention and gets around the problem of possibly
> overloading the current rt lock implementation that we've been bitching
> about. The current RCU development track seem wrong in the first place and
> this seem like it could be a better and more complete solution to the problem.

Got to get rid of those typos :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread hui
On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote:
> On Fri, 18 Mar 2005, Ingo Molnar wrote:
> > i really have no intention to allow multiple readers for rt-mutexes. We
> > got away with that so far, and i'd like to keep it so. Imagine 100
> > threads all blocked in the same critical section (holding the read-lock)
> > when a highprio writer thread comes around: instant 100x latency to let
> > all of them roll forward. The only sane solution is to not allow
> > excessive concurrency. (That limits SMP scalability, but there's no
> > other choice i can see.)
> 
> Unless a design change is made: One could argue for a semantics where
> write-locking _isn't_ deterministic and thus do not have to boost all the

RCU isn't write deterministic like typical RT apps are we can... (below :-))

> readers. Readers boost the writers but not the other way around. Readers
> will be deterministic, but not writers.
> Such a semantics would probably work for a lot of RT applications
> happening not to take any write-locks - these will in fact perform better. 
> But it will give the rest a lot of problems.

Just came up with an idea after I thought about how much of a bitch it
would be to get a fast RCU multipule reader semantic (our current shared-
exclusive lock inserts owners into a sorted priority list per-thread which
makes it very expensive for a simple RCU case since they are typically very
small batches of items being altered). Basically the RCU algorithm has *no*
notion of writer priority and to propagate a PI operation down all reader
is meaningless, so why not revert back to the original rwlock-semaphore to
get the multipule reader semantics ?

A notion of priority across a quiescience operation is crazy anyways, so
it would be safe just to use to the old rwlock-semaphore "in place" without
any changes or priorty handling addtions. The RCU algorithm is only concerned
with what is basically a coarse data guard and it isn't time or priority
critical.

What do you folks think ? That would make Paul's stuff respect multipule
readers which reduces contention and gets around the problem of possibly
overloading the current rt lock implementation that we've been bitching
about. The current RCU development track seem wrong in the first place and
this seem like it could be a better and more complete solution to the problem.

If this works, well, you heard it here first. :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Ingo Molnar

* Esben Nielsen <[EMAIL PROTECTED]> wrote:

> On the other hand with a rw-lock being unlimited - and thus do not
> keep track of it readers - the readers can't be boosted by the writer.
> Then you are back to square 1: The grace period can take a very long
> time.

btw., is the 'very long grace period' a real issue? We could avoid all
the RCU read-side locking latencies by making it truly unlocked and just
living with the long grace periods. Perhaps it's enough to add an
emergency mechanism to the OOM handler, which frees up all the 'blocked
by preemption' RCU callbacks via some scheduler magic. (e.g. such an
emergency mechanism could be _conditional_ locking on the read side -
i.e. new RCU read-side users would be blocked until the OOM situation
goes away, or something like that.)

your patch is implementing just that, correct? Would you mind redoing it
against a recent -RT base? (-40-04 or so)

also, what would be the worst-case workload causing long grace periods?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen

On Mon, 21 Mar 2005, Paul E. McKenney wrote:

> On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> > > [...] 
> > Well, I was actually thinking of an API like
> >  preempt_by_nonrt_disable()
> >  preempt_by_nonrt_enable()
> > working like the old preempt_disable()/preempt_enable() but still
> > allowing RT tasks (as well as priority inheriting non-RT tasks) to be
> > scheduled. That would kind of help "split" the kernel into two halfs: the
> > RT part and the non-RT part. The non-RT part would in many ways work as it
> > has always done.
> 
> Does sound in some ways similar to the migration approach -- there, the
> RT/non-RT split is made across CPUs.  But if RT is allowed to preempt,
> then you still have to deal with preemption for locking correctness, right?
> 
Yes. It is not a locking mechanism, it should just prevent scheduling of
"normal" userspace tasks.

> > [..] 
> > Well, in my patch I do not leave preemption off - only while doing the
> > simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
> > might be active, i.e. preempted, on the current CPU such that this isn't
> > and quiescent point after all.
> > (To others: Paul nicely unfolded my attachment below - I left it in
> > the mail such you can read it.)
> > The problem with this approach is ofcourse that user space programs might
> > preempt an RCU reader for a very long time such that RCU batches are never
> > really run. The boosting of non-RT tasks mentioned above would help a
> > lot.
> > A plus(?) in it: You can actually sleep while having the rcu_read_lock !!
> 
> This is in some ways similar to the K42 approach to RCU (which they call
> "generations").  Dipankar put together a similar patch for Linux, but
> the problem was that grace periods could be deferred for an extremely
> long time.  Which I suspect is what you were calling out as causing
> RCU batches never to run.
> 

That is where the preempt_by_nonrt_disable/enable() is supposed to help:
Then it can't take longer than the normal kernel in the situation where
there is no RT tasks running. RT tasks will prolong the grace periods if
they go into RCU regions, but they are supposed to be relatively small -
and deterministic!

> > > The counter approach might work, and is also what the implementation #5
> > > does -- check out rcu_read_lock() in Ingo's most recent patch.
> > > 
> > 
> > Do you refer to your original mail with implementing it in 5 steps?
> > In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
> > forces syncronization between the CPUs - bad for scaling. It does make the
> > RCU batches somewhat deterministic, as the RCU task can boost the readers
> > to the rcu-task's priority.
> > The problem about this approach is that everybody calling into RCU code
> > have a worst case behaviour of the systemwide worst case RCU reader 
> > section - which can be pretty large (in principle infinite if somebody.)
> > So if somebody uses a call to a function in the code containing a RCU read
> > area the worst case behavious would be the same as teh worst case latency
> > in the simple world where preempt_disable()/preempt_enable() was used.
> 
> I missed something here -- readers would see the worst-case -writer-
> latency rather than the worst-case -reader- latency, right?  Or are you
> concerned about the case where some reader blocks the write-side
> acquisitions in _synchronize_kernel(), but not before the writer has
> grabbed a lock, blocking any readers on the corresponding CPU?
> 

I am conserned that readers block each other, too. You do need an rw-mutex
allowing an unlimited number of readers for doing this. With the current
rw-mutex the readers block each other. I.e. the worst case latency is the
worst case reader latency - globally!
On the other hand with a rw-lock being unlimited - and thus do not keep
track of it readers - the readers can't be boosted by the writer. Then you
are back to square 1: The grace period can take a very long time.

> Yes, but this is true of every other lock in the system as well, not?

Other locks are not globaly used but only used for a specific subsystem.
On a real-time system you are supposed to know which subsystems you can
call into and still have a low enough latency as each subsystem has it's
own bound. But with a global RCU locking mechanism all RCU using code is
to be regarded as _one_ such subsystem.

> In a separate conversation a few weeks ago, Steve Hemminger suggested
> placing checks in long RCU read-side critical sections -- this could
> be used to keep the worst-case reader latency within the desired bounds.
> Not pretty, but, then again, bounding lock latency will require a fair
> number of similar changes, I suspect.
> 
>   Thanx, Paul
> 

Esben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: Real-Time Preemption and RCU

2005-03-22 Thread Ingo Molnar

* Kyle Moffett <[EMAIL PROTECTED]> wrote:

> One solution I can think of, although it bloats memory usage for
> many-way boxen, is to just have a table in the rwlock with one entry
> per cpu.  Each CPU would get one concurrent reader, others would need
> to sleep

yes, it bloats memory usage, and complicates PI quite significantly. 
We've been there and have done something similar to that in earlier -RT
patches - 64 'owner' entries in the lock already caused problems on 8K
stacks. 32 seemed to work but caused quite some bloat in data structure
sizes.

i'd rather live with some scalability bottleneck for now, in exchange of
a wastly easier to handle design. We can still complicate things later
on for better scalability, once all the other issues have been sorted
out.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Ingo Molnar

* Kyle Moffett [EMAIL PROTECTED] wrote:

 One solution I can think of, although it bloats memory usage for
 many-way boxen, is to just have a table in the rwlock with one entry
 per cpu.  Each CPU would get one concurrent reader, others would need
 to sleep

yes, it bloats memory usage, and complicates PI quite significantly. 
We've been there and have done something similar to that in earlier -RT
patches - 64 'owner' entries in the lock already caused problems on 8K
stacks. 32 seemed to work but caused quite some bloat in data structure
sizes.

i'd rather live with some scalability bottleneck for now, in exchange of
a wastly easier to handle design. We can still complicate things later
on for better scalability, once all the other issues have been sorted
out.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen

On Mon, 21 Mar 2005, Paul E. McKenney wrote:

 On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
   [...] 
  Well, I was actually thinking of an API like
   preempt_by_nonrt_disable()
   preempt_by_nonrt_enable()
  working like the old preempt_disable()/preempt_enable() but still
  allowing RT tasks (as well as priority inheriting non-RT tasks) to be
  scheduled. That would kind of help split the kernel into two halfs: the
  RT part and the non-RT part. The non-RT part would in many ways work as it
  has always done.
 
 Does sound in some ways similar to the migration approach -- there, the
 RT/non-RT split is made across CPUs.  But if RT is allowed to preempt,
 then you still have to deal with preemption for locking correctness, right?
 
Yes. It is not a locking mechanism, it should just prevent scheduling of
normal userspace tasks.

  [..] 
  Well, in my patch I do not leave preemption off - only while doing the
  simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
  might be active, i.e. preempted, on the current CPU such that this isn't
  and quiescent point after all.
  (To others: Paul nicely unfolded my attachment below - I left it in
  the mail such you can read it.)
  The problem with this approach is ofcourse that user space programs might
  preempt an RCU reader for a very long time such that RCU batches are never
  really run. The boosting of non-RT tasks mentioned above would help a
  lot.
  A plus(?) in it: You can actually sleep while having the rcu_read_lock !!
 
 This is in some ways similar to the K42 approach to RCU (which they call
 generations).  Dipankar put together a similar patch for Linux, but
 the problem was that grace periods could be deferred for an extremely
 long time.  Which I suspect is what you were calling out as causing
 RCU batches never to run.
 

That is where the preempt_by_nonrt_disable/enable() is supposed to help:
Then it can't take longer than the normal kernel in the situation where
there is no RT tasks running. RT tasks will prolong the grace periods if
they go into RCU regions, but they are supposed to be relatively small -
and deterministic!

   The counter approach might work, and is also what the implementation #5
   does -- check out rcu_read_lock() in Ingo's most recent patch.
   
  
  Do you refer to your original mail with implementing it in 5 steps?
  In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
  forces syncronization between the CPUs - bad for scaling. It does make the
  RCU batches somewhat deterministic, as the RCU task can boost the readers
  to the rcu-task's priority.
  The problem about this approach is that everybody calling into RCU code
  have a worst case behaviour of the systemwide worst case RCU reader 
  section - which can be pretty large (in principle infinite if somebody.)
  So if somebody uses a call to a function in the code containing a RCU read
  area the worst case behavious would be the same as teh worst case latency
  in the simple world where preempt_disable()/preempt_enable() was used.
 
 I missed something here -- readers would see the worst-case -writer-
 latency rather than the worst-case -reader- latency, right?  Or are you
 concerned about the case where some reader blocks the write-side
 acquisitions in _synchronize_kernel(), but not before the writer has
 grabbed a lock, blocking any readers on the corresponding CPU?
 

I am conserned that readers block each other, too. You do need an rw-mutex
allowing an unlimited number of readers for doing this. With the current
rw-mutex the readers block each other. I.e. the worst case latency is the
worst case reader latency - globally!
On the other hand with a rw-lock being unlimited - and thus do not keep
track of it readers - the readers can't be boosted by the writer. Then you
are back to square 1: The grace period can take a very long time.

 Yes, but this is true of every other lock in the system as well, not?

Other locks are not globaly used but only used for a specific subsystem.
On a real-time system you are supposed to know which subsystems you can
call into and still have a low enough latency as each subsystem has it's
own bound. But with a global RCU locking mechanism all RCU using code is
to be regarded as _one_ such subsystem.

 In a separate conversation a few weeks ago, Steve Hemminger suggested
 placing checks in long RCU read-side critical sections -- this could
 be used to keep the worst-case reader latency within the desired bounds.
 Not pretty, but, then again, bounding lock latency will require a fair
 number of similar changes, I suspect.
 
   Thanx, Paul
 

Esben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Ingo Molnar

* Esben Nielsen [EMAIL PROTECTED] wrote:

 On the other hand with a rw-lock being unlimited - and thus do not
 keep track of it readers - the readers can't be boosted by the writer.
 Then you are back to square 1: The grace period can take a very long
 time.

btw., is the 'very long grace period' a real issue? We could avoid all
the RCU read-side locking latencies by making it truly unlocked and just
living with the long grace periods. Perhaps it's enough to add an
emergency mechanism to the OOM handler, which frees up all the 'blocked
by preemption' RCU callbacks via some scheduler magic. (e.g. such an
emergency mechanism could be _conditional_ locking on the read side -
i.e. new RCU read-side users would be blocked until the OOM situation
goes away, or something like that.)

your patch is implementing just that, correct? Would you mind redoing it
against a recent -RT base? (-40-04 or so)

also, what would be the worst-case workload causing long grace periods?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread hui
On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote:
 On Fri, 18 Mar 2005, Ingo Molnar wrote:
  i really have no intention to allow multiple readers for rt-mutexes. We
  got away with that so far, and i'd like to keep it so. Imagine 100
  threads all blocked in the same critical section (holding the read-lock)
  when a highprio writer thread comes around: instant 100x latency to let
  all of them roll forward. The only sane solution is to not allow
  excessive concurrency. (That limits SMP scalability, but there's no
  other choice i can see.)
 
 Unless a design change is made: One could argue for a semantics where
 write-locking _isn't_ deterministic and thus do not have to boost all the

RCU isn't write deterministic like typical RT apps are we can... (below :-))

 readers. Readers boost the writers but not the other way around. Readers
 will be deterministic, but not writers.
 Such a semantics would probably work for a lot of RT applications
 happening not to take any write-locks - these will in fact perform better. 
 But it will give the rest a lot of problems.

Just came up with an idea after I thought about how much of a bitch it
would be to get a fast RCU multipule reader semantic (our current shared-
exclusive lock inserts owners into a sorted priority list per-thread which
makes it very expensive for a simple RCU case since they are typically very
small batches of items being altered). Basically the RCU algorithm has *no*
notion of writer priority and to propagate a PI operation down all reader
is meaningless, so why not revert back to the original rwlock-semaphore to
get the multipule reader semantics ?

A notion of priority across a quiescience operation is crazy anyways, so
it would be safe just to use to the old rwlock-semaphore in place without
any changes or priorty handling addtions. The RCU algorithm is only concerned
with what is basically a coarse data guard and it isn't time or priority
critical.

What do you folks think ? That would make Paul's stuff respect multipule
readers which reduces contention and gets around the problem of possibly
overloading the current rt lock implementation that we've been bitching
about. The current RCU development track seem wrong in the first place and
this seem like it could be a better and more complete solution to the problem.

If this works, well, you heard it here first. :)

bill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread hui
On Tue, Mar 22, 2005 at 02:04:46AM -0800, Bill Huey wrote:
 RCU isn't write deterministic like typical RT apps are[, so] we can... (below 
 :-))
... 
 Just came up with an idea after I thought about how much of a bitch it
 would be to get a fast RCU multipule reader semantic (our current shared-
 exclusive lock inserts owners into a sorted priority list per-thread which
 makes it very expensive for a simple RCU case[,] since they are typically very
 small batches of items being altered). Basically the RCU algorithm has *no*
 notion of writer priority and to propagate a PI operation down all reader[s]
 is meaningless, so why not revert back to the original rwlock-semaphore to
 get the multipule reader semantics ?

The original lock, for those that don't know, doesn't strictly track read owners
so reentrancy is cheap.

 A notion of priority across a quiescience operation is crazy anyways[-,-] so
 it would be safe just to use to the old rwlock-semaphore in place without
 any changes or priorty handling add[i]tions. The RCU algorithm is only 
 concerned
 with what is basically a coarse data guard and it isn't time or priority
 critical.

A little jitter in a quiescence operation isn't going to hurt things right ?. 

 What do you folks think ? That would make Paul's stuff respect multipule
 readers which reduces contention and gets around the problem of possibly
 overloading the current rt lock implementation that we've been bitching
 about. The current RCU development track seem wrong in the first place and
 this seem like it could be a better and more complete solution to the problem.

Got to get rid of those typos :)

bill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen
On Tue, 22 Mar 2005, Ingo Molnar wrote:

 
 * Esben Nielsen [EMAIL PROTECTED] wrote:
 
  On the other hand with a rw-lock being unlimited - and thus do not
  keep track of it readers - the readers can't be boosted by the writer.
  Then you are back to square 1: The grace period can take a very long
  time.
 
 btw., is the 'very long grace period' a real issue? We could avoid all
 the RCU read-side locking latencies by making it truly unlocked and just
 living with the long grace periods. Perhaps it's enough to add an
 emergency mechanism to the OOM handler, which frees up all the 'blocked
 by preemption' RCU callbacks via some scheduler magic. (e.g. such an
 emergency mechanism could be _conditional_ locking on the read side -
 i.e. new RCU read-side users would be blocked until the OOM situation
 goes away, or something like that.)

You wont catch RCU read-sides already entered - see below.

 
 your patch is implementing just that, correct? Would you mind redoing it
 against a recent -RT base? (-40-04 or so)


In fact I am working on clean 2.6.12-rc1 right now. I figured this is
orthorgonal to the rest RT patch and can probably go upstream immediately.
Seemed to work. I'll try to make into a clean patch soonish and also try
it on -40-04. 
I am trying to make a boosting mechanism in the scheduler such that RCU
readers are boosted to MAX_RT_PRIO when preempted. I have to take it out
first.

Any specific tests I have to run? I am considering making an RCU test
device.

 also, what would be the worst-case workload causing long grace periods?

A nice 19 task, A, enter an RCU region and is preempted. A lot of other
tasks starts running. Then task A might starved for _minuttes_ such that
there is no RCU-grace periods in all that time.

 
   Ingo

Esben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread hui
On Tue, Mar 22, 2005 at 02:17:27AM -0800, Bill Huey wrote:
  A notion of priority across a quiescience operation is crazy anyways[-,-] so
  it would be safe just to use to the old rwlock-semaphore in place without
  any changes or priorty handling add[i]tions. The RCU algorithm is only 
  concerned
  with what is basically a coarse data guard and it isn't time or priority
  critical.
 
 A little jitter in a quiescence operation isn't going to hurt things right ?. 

The only thing that I can think of that can go wrong here is what kind
of effect it would have on the thread write blocking against a bunch of
RCU readers. It could introduce a chain of delays into, say, a timer event
and might cause problems/side-effects for other things being processed.
RCU processing might have to decoupled processed by a different thread
to avoid some of that latency weirdness.

What do you folks think ?

bill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen
On Tue, 22 Mar 2005, Bill Huey wrote:

 On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote:
  On Fri, 18 Mar 2005, Ingo Molnar wrote:
   i really have no intention to allow multiple readers for rt-mutexes. We
   got away with that so far, and i'd like to keep it so. Imagine 100
   threads all blocked in the same critical section (holding the read-lock)
   when a highprio writer thread comes around: instant 100x latency to let
   all of them roll forward. The only sane solution is to not allow
   excessive concurrency. (That limits SMP scalability, but there's no
   other choice i can see.)
  
  Unless a design change is made: One could argue for a semantics where
  write-locking _isn't_ deterministic and thus do not have to boost all the
 
 RCU isn't write deterministic like typical RT apps are we can... (below :-))

It is: It takes place right away. But it is not non-deterministic when
_all_ readers actually read it. Also the cleanup is non-deterministic.
So unless you actually _wait_ for the cleanup to happen instead of
defering it you can safely do RCU writes in a RT-task.

 
  readers. Readers boost the writers but not the other way around. Readers
  will be deterministic, but not writers.
  Such a semantics would probably work for a lot of RT applications
  happening not to take any write-locks - these will in fact perform better. 
  But it will give the rest a lot of problems.
 
 Just came up with an idea after I thought about how much of a bitch it
 would be to get a fast RCU multipule reader semantic (our current shared-
 exclusive lock inserts owners into a sorted priority list per-thread which
 makes it very expensive for a simple RCU case since they are typically very
 small batches of items being altered). Basically the RCU algorithm has *no*
 notion of writer priority and to propagate a PI operation down all reader
 is meaningless, so why not revert back to the original rwlock-semaphore to
 get the multipule reader semantics ?

Remember to boost the writer such RT tasks can enter read regions. I must
also warn against the dangers: A lot of code where a write-lock is taken 
need to marked as non-deterministic, i.e. must not-be called from
RT-tasks (maybe put a WARN_ON(rt_task(current)) in the write-lock
operation?)

 
 A notion of priority across a quiescience operation is crazy anyways, so
 it would be safe just to use to the old rwlock-semaphore in place without
 any changes or priorty handling addtions. The RCU algorithm is only concerned
 with what is basically a coarse data guard and it isn't time or priority
 critical.

I don't find it crazy. I think it is elegant - but also dangerous as it
might take a long time.

 
 What do you folks think ? That would make Paul's stuff respect multipule
 readers which reduces contention and gets around the problem of possibly
 overloading the current rt lock implementation that we've been bitching
 about. The current RCU development track seem wrong in the first place and
 this seem like it could be a better and more complete solution to the problem.
 
 If this works, well, you heard it here first. :)
 
 bill
 
Esben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Ingo Molnar

* Esben Nielsen [EMAIL PROTECTED] wrote:

 +static inline void rcu_read_lock(void)
 +{
 + preempt_disable(); 
 + __get_cpu_var(rcu_data).active_readers++;
 + preempt_enable();
 +}

this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
the same CPU, and hence -active_readers can get out of sync.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen
On Tue, 22 Mar 2005, Ingo Molnar wrote:

 
 * Esben Nielsen [EMAIL PROTECTED] wrote:
 
  +static inline void rcu_read_lock(void)
  +{  
  +   preempt_disable(); 
  +   __get_cpu_var(rcu_data).active_readers++;
  +   preempt_enable();
  +}
 
 this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
 the same CPU, and hence -active_readers can get out of sync.
 

Ok, this have to be handled in the mitigration code somehow. I have already 
added an 
  current-rcu_read_depth++
so it ought to be painless. A simple solution would be not to mititagrate
threads with rcu_read_depth!=0.

   Ingo
 

Esben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Ingo Molnar

* Esben Nielsen [EMAIL PROTECTED] wrote:

   +static inline void rcu_read_lock(void)
   +{
   + preempt_disable(); 
   + __get_cpu_var(rcu_data).active_readers++;
   + preempt_enable();
   +}
  
  this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
  the same CPU, and hence -active_readers can get out of sync.
  
 
 Ok, this have to be handled in the mitigration code somehow. I have already 
 added an 
   current-rcu_read_depth++
 so it ought to be painless. A simple solution would be not to
 mititagrate threads with rcu_read_depth!=0.

could you elaborate?

In any case, see the new PREEMPT_RCU code in the -40-07 patch (and
upwards). I've also attached a separate patch, it should apply cleanly
to 2.6.12-rc1.

Ingo
--- linux/kernel/rcupdate.c.orig
+++ linux/kernel/rcupdate.c
@@ -468,3 +468,36 @@ module_param(maxbatch, int, 0);
 EXPORT_SYMBOL(call_rcu);
 EXPORT_SYMBOL(call_rcu_bh);
 EXPORT_SYMBOL(synchronize_kernel);
+
+#ifdef CONFIG_PREEMPT_RCU
+
+void rcu_read_lock(void)
+{
+   if (current-rcu_read_lock_nesting++ == 0) {
+   current-rcu_data = get_cpu_var(rcu_data);
+   atomic_inc(current-rcu_data-active_readers);
+   put_cpu_var(rcu_data);
+   }
+}
+EXPORT_SYMBOL(rcu_read_lock);
+
+void rcu_read_unlock(void)
+{
+   int cpu;
+
+   if (--current-rcu_read_lock_nesting == 0) {
+   atomic_dec(current-rcu_data-active_readers);
+   /*
+* Check whether we have reached quiescent state.
+* Note! This is only for the local CPU, not for
+* current-rcu_data's CPU [which typically is the
+* current CPU, but may also be another CPU].
+*/
+   cpu = get_cpu();
+   rcu_qsctr_inc(cpu);
+   put_cpu();
+   }
+}
+EXPORT_SYMBOL(rcu_read_unlock);
+
+#endif
--- linux/include/linux/rcupdate.h.orig
+++ linux/include/linux/rcupdate.h
@@ -99,6 +99,9 @@ struct rcu_data {
struct rcu_head *donelist;
struct rcu_head **donetail;
int cpu;
+#ifdef CONFIG_PREEMPT_RCU
+   atomic_tactive_readers;
+#endif
 };
 
 DECLARE_PER_CPU(struct rcu_data, rcu_data);
@@ -115,12 +118,18 @@ extern struct rcu_ctrlblk rcu_bh_ctrlblk
 static inline void rcu_qsctr_inc(int cpu)
 {
struct rcu_data *rdp = per_cpu(rcu_data, cpu);
-   rdp-passed_quiesc = 1;
+#ifdef CONFIG_PREEMPT_RCU
+   if (!atomic_read(rdp-active_readers))
+#endif
+   rdp-passed_quiesc = 1;
 }
 static inline void rcu_bh_qsctr_inc(int cpu)
 {
struct rcu_data *rdp = per_cpu(rcu_bh_data, cpu);
-   rdp-passed_quiesc = 1;
+#ifdef CONFIG_PREEMPT_RCU
+   if (!atomic_read(rdp-active_readers))
+#endif
+   rdp-passed_quiesc = 1;
 }
 
 static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
@@ -183,14 +192,22 @@ static inline int rcu_pending(int cpu)
  *
  * It is illegal to block while in an RCU read-side critical section.
  */
-#define rcu_read_lock()preempt_disable()
+#ifdef CONFIG_PREEMPT_RCU
+  extern void rcu_read_lock(void);
+#else
+# define rcu_read_lock preempt_disable()
+#endif
 
 /**
  * rcu_read_unlock - marks the end of an RCU read-side critical section.
  *
  * See rcu_read_lock() for more information.
  */
-#define rcu_read_unlock()  preempt_enable()
+#ifdef CONFIG_PREEMPT_RCU
+  extern void rcu_read_unlock(void);
+#else
+# define rcu_read_unlock preempt_enable()
+#endif
 
 /*
  * So where is rcu_write_lock()?  It does not exist, as there is no
@@ -213,14 +230,16 @@ static inline int rcu_pending(int cpu)
  * can use just rcu_read_lock().
  *
  */
-#define rcu_read_lock_bh() local_bh_disable()
+//#define rcu_read_lock_bh()   local_bh_disable()
+#define rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
 
 /*
  * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
  *
  * See rcu_read_lock_bh() for more information.
  */
-#define rcu_read_unlock_bh()   local_bh_enable()
+//#define rcu_read_unlock_bh() local_bh_enable()
+#define rcu_read_unlock_bh()   { local_bh_enable(); rcu_read_unlock(); }
 
 /**
  * rcu_dereference - fetch an RCU-protected pointer in an
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -727,6 +872,10 @@ struct task_struct {
nodemask_t mems_allowed;
int cpuset_mems_generation;
 #endif
+#ifdef CONFIG_PREEMPT_RCU
+   int rcu_read_lock_nesting;
+   struct rcu_data *rcu_data;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)


Re: Real-Time Preemption and RCU

2005-03-22 Thread Esben Nielsen
On Tue, 22 Mar 2005, Ingo Molnar wrote:

 
 * Esben Nielsen [EMAIL PROTECTED] wrote:
 
+static inline void rcu_read_lock(void)
+{  
+   preempt_disable(); 
+   __get_cpu_var(rcu_data).active_readers++;
+   preempt_enable();
+}
   
   this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
   the same CPU, and hence -active_readers can get out of sync.
   
  
  Ok, this have to be handled in the mitigration code somehow. I have already 
  added an 
current-rcu_read_depth++
  so it ought to be painless. A simple solution would be not to
  mititagrate threads with rcu_read_depth!=0.
 
 could you elaborate?
 
Put an rcu_read_depth on each task. In rcu_read_lock() make a 
 current-rcu_read_depth++;
and visa versa in rcu_read_unlock(). In can_migrate_task() add
   if(p-rcu_read_depth)
 return 0;
That might do the trick.
 

 In any case, see the new PREEMPT_RCU code in the -40-07 patch (and
 upwards). I've also attached a separate patch, it should apply cleanly
 to 2.6.12-rc1.
 

I barely have time to download at the patches - let alone applying them!

Anyway: I find one thing I don't like: using atomic_inc()/dec() in
rcu_read_lock()/unlock() to touch rcu_data which might be on another
CPU. Then rcu_data is not really per-CPU data anymore and it also hurts
performance of RCU readers.
I think it will be cheaper to use the above rcu_read_depth and then either
not migrate tasks at all or make the migrate code take care of migrating
the rcu_read_depth count to the new CPU - one would have to take care to
increment it in the rcu_data of the new CPU on the new CPU (it isn't
atomic) and then decrement it in the rcu_data of the old CPU on the old
CPU - in that order.


   Ingo
 


Esben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-22 Thread Paul E. McKenney
On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
 On Mon, 21 Mar 2005, Paul E. McKenney wrote:
[ . . . ]
  On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
  This is in some ways similar to the K42 approach to RCU (which they call
  generations).  Dipankar put together a similar patch for Linux, but
  the problem was that grace periods could be deferred for an extremely
  long time.  Which I suspect is what you were calling out as causing
  RCU batches never to run.
 
 That is where the preempt_by_nonrt_disable/enable() is supposed to help:
 Then it can't take longer than the normal kernel in the situation where
 there is no RT tasks running. RT tasks will prolong the grace periods if
 they go into RCU regions, but they are supposed to be relatively small -
 and deterministic!

The part that I am missing is how this helps in the case where a non-RT
task gets preempted in the middle of an RCU read-side critical section
indefinitely.  Or are you boosting the priority of any task that
enters an RCU read-side critical section?

The counter approach might work, and is also what the implementation #5
does -- check out rcu_read_lock() in Ingo's most recent patch.

   
   Do you refer to your original mail with implementing it in 5 steps?
   In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
   forces syncronization between the CPUs - bad for scaling. It does make the
   RCU batches somewhat deterministic, as the RCU task can boost the readers
   to the rcu-task's priority.
   The problem about this approach is that everybody calling into RCU code
   have a worst case behaviour of the systemwide worst case RCU reader 
   section - which can be pretty large (in principle infinite if somebody.)
   So if somebody uses a call to a function in the code containing a RCU read
   area the worst case behavious would be the same as teh worst case latency
   in the simple world where preempt_disable()/preempt_enable() was used.
  
  I missed something here -- readers would see the worst-case -writer-
  latency rather than the worst-case -reader- latency, right?  Or are you
  concerned about the case where some reader blocks the write-side
  acquisitions in _synchronize_kernel(), but not before the writer has
  grabbed a lock, blocking any readers on the corresponding CPU?
 
 I am conserned that readers block each other, too. You do need an rw-mutex
 allowing an unlimited number of readers for doing this. With the current
 rw-mutex the readers block each other. I.e. the worst case latency is the
 worst case reader latency - globally!
 On the other hand with a rw-lock being unlimited - and thus do not keep
 track of it readers - the readers can't be boosted by the writer. Then you
 are back to square 1: The grace period can take a very long time.

OK, good point.

  Yes, but this is true of every other lock in the system as well, not?
 
 Other locks are not globaly used but only used for a specific subsystem.
 On a real-time system you are supposed to know which subsystems you can
 call into and still have a low enough latency as each subsystem has it's
 own bound. But with a global RCU locking mechanism all RCU using code is
 to be regarded as _one_ such subsystem.

Yep.  As would the things protected by the dcache lock, task list lock,
and so on, right?

Thanx, Paul

  In a separate conversation a few weeks ago, Steve Hemminger suggested
  placing checks in long RCU read-side critical sections -- this could
  be used to keep the worst-case reader latency within the desired bounds.
  Not pretty, but, then again, bounding lock latency will require a fair
  number of similar changes, I suspect.
  
  Thanx, Paul
  
 
 Esben
 
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-21 Thread Paul E. McKenney
On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> On Sun, 20 Mar 2005, Paul E. McKenney wrote:
> 
> > On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
> > > On Fri, 18 Mar 2005, Ingo Molnar wrote:
> > > 
> > > > [...]
> > > 
> > > I think it can be deterministic (on the long timescale of memory 
> > > management) 
> > > anyway: Boost any non-RT task entering an RCU region to the lowest RT 
> > > priority.
> > > This way only all the RT tasks + one non-RT task can be within those
> > > regions. The RT-tasks are supposed to have some kind of upper bound to
> > > their CPU-usage. The non-RT task will also finish "soon" as it is
> > > boosted. If the RCU batches are also at the lowest RT-priority they can be
> > > run immediately after the non-RT task is done.
> > 
> > Hmmm...  Sort of a preemptive-first-strike priority boost.  Cute!  ;-)
> > 
> Well, I was actually thinking of an API like
>  preempt_by_nonrt_disable()
>  preempt_by_nonrt_enable()
> working like the old preempt_disable()/preempt_enable() but still
> allowing RT tasks (as well as priority inheriting non-RT tasks) to be
> scheduled. That would kind of help "split" the kernel into two halfs: the
> RT part and the non-RT part. The non-RT part would in many ways work as it
> has always done.

Does sound in some ways similar to the migration approach -- there, the
RT/non-RT split is made across CPUs.  But if RT is allowed to preempt,
then you still have to deal with preemption for locking correctness, right?

> > > > clearly the simplest solution is to go with the single-reader locks for
> > > > now - a separate experiment could be done with a new type of rwlock that
> > > > can only be used by the RCU code. (I'm not quite sure whether we could
> > > > guarantee a minimum rate of RCU callback processing under such a scheme
> > > > though. It's an eventual memory DoS otherwise.)
> > > > 
> > > 
> > > Why are a lock needed at all? If it is doable without locking for an
> > > non-preemptable SMP kernel it must be doable for an preemptable kernel as
> > > well.I am convinced some kind of per-CPU rcu_read_count as I specified in
> > > my previous mail can work some way or the other. call_rcu() might need to
> > > do more complicated stuff and thus use CPU but call_rcu() is supposed to
> > > be an relative rare event not worth optimizing for.  Such an
> > > implementation will work for any preemptable kernel, not only PREEMPT_RT. 
> > > For performance is considered it is important not to acquire any locks in
> > > the rcu-read regions. 
> > 
> > You definitely don't need a lock -- you can just suppress preemption
> > on the read side instead.  But that potentially makes for long scheduling
> > latencies.
> 
> Well, in my patch I do not leave preemption off - only while doing the
> simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
> might be active, i.e. preempted, on the current CPU such that this isn't
> and quiescent point after all.
> (To others: Paul nicely unfolded my attachment below - I left it in
> the mail such you can read it.)
> The problem with this approach is ofcourse that user space programs might
> preempt an RCU reader for a very long time such that RCU batches are never
> really run. The boosting of non-RT tasks mentioned above would help a
> lot.
> A plus(?) in it: You can actually sleep while having the rcu_read_lock !!

This is in some ways similar to the K42 approach to RCU (which they call
"generations").  Dipankar put together a similar patch for Linux, but
the problem was that grace periods could be deferred for an extremely
long time.  Which I suspect is what you were calling out as causing
RCU batches never to run.

> > The counter approach might work, and is also what the implementation #5
> > does -- check out rcu_read_lock() in Ingo's most recent patch.
> > 
> 
> Do you refer to your original mail with implementing it in 5 steps?
> In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
> forces syncronization between the CPUs - bad for scaling. It does make the
> RCU batches somewhat deterministic, as the RCU task can boost the readers
> to the rcu-task's priority.
> The problem about this approach is that everybody calling into RCU code
> have a worst case behaviour of the systemwide worst case RCU reader 
> section - which can be pretty large (in principle infinite if somebody.)
> So if somebody uses a call to a function in the code containing a RCU read
> area the worst case behavious would be the same as teh worst case latency
> in the simple world where preempt_disable()/preempt_enable() was used.

I missed something here -- readers would see the worst-case -writer-
latency rather than the worst-case -reader- latency, right?  Or are you
concerned about the case where some reader blocks the write-side
acquisitions in _synchronize_kernel(), but not before the writer has
grabbed a lock, blocking any readers on the corresponding CPU?

Yes, but this is 

Re: Real-Time Preemption and RCU

2005-03-21 Thread Paul E. McKenney
On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
 On Sun, 20 Mar 2005, Paul E. McKenney wrote:
 
  On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
   On Fri, 18 Mar 2005, Ingo Molnar wrote:
   
[...]
   
   I think it can be deterministic (on the long timescale of memory 
   management) 
   anyway: Boost any non-RT task entering an RCU region to the lowest RT 
   priority.
   This way only all the RT tasks + one non-RT task can be within those
   regions. The RT-tasks are supposed to have some kind of upper bound to
   their CPU-usage. The non-RT task will also finish soon as it is
   boosted. If the RCU batches are also at the lowest RT-priority they can be
   run immediately after the non-RT task is done.
  
  Hmmm...  Sort of a preemptive-first-strike priority boost.  Cute!  ;-)
  
 Well, I was actually thinking of an API like
  preempt_by_nonrt_disable()
  preempt_by_nonrt_enable()
 working like the old preempt_disable()/preempt_enable() but still
 allowing RT tasks (as well as priority inheriting non-RT tasks) to be
 scheduled. That would kind of help split the kernel into two halfs: the
 RT part and the non-RT part. The non-RT part would in many ways work as it
 has always done.

Does sound in some ways similar to the migration approach -- there, the
RT/non-RT split is made across CPUs.  But if RT is allowed to preempt,
then you still have to deal with preemption for locking correctness, right?

clearly the simplest solution is to go with the single-reader locks for
now - a separate experiment could be done with a new type of rwlock that
can only be used by the RCU code. (I'm not quite sure whether we could
guarantee a minimum rate of RCU callback processing under such a scheme
though. It's an eventual memory DoS otherwise.)

   
   Why are a lock needed at all? If it is doable without locking for an
   non-preemptable SMP kernel it must be doable for an preemptable kernel as
   well.I am convinced some kind of per-CPU rcu_read_count as I specified in
   my previous mail can work some way or the other. call_rcu() might need to
   do more complicated stuff and thus use CPU but call_rcu() is supposed to
   be an relative rare event not worth optimizing for.  Such an
   implementation will work for any preemptable kernel, not only PREEMPT_RT. 
   For performance is considered it is important not to acquire any locks in
   the rcu-read regions. 
  
  You definitely don't need a lock -- you can just suppress preemption
  on the read side instead.  But that potentially makes for long scheduling
  latencies.
 
 Well, in my patch I do not leave preemption off - only while doing the
 simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
 might be active, i.e. preempted, on the current CPU such that this isn't
 and quiescent point after all.
 (To others: Paul nicely unfolded my attachment below - I left it in
 the mail such you can read it.)
 The problem with this approach is ofcourse that user space programs might
 preempt an RCU reader for a very long time such that RCU batches are never
 really run. The boosting of non-RT tasks mentioned above would help a
 lot.
 A plus(?) in it: You can actually sleep while having the rcu_read_lock !!

This is in some ways similar to the K42 approach to RCU (which they call
generations).  Dipankar put together a similar patch for Linux, but
the problem was that grace periods could be deferred for an extremely
long time.  Which I suspect is what you were calling out as causing
RCU batches never to run.

  The counter approach might work, and is also what the implementation #5
  does -- check out rcu_read_lock() in Ingo's most recent patch.
  
 
 Do you refer to your original mail with implementing it in 5 steps?
 In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
 forces syncronization between the CPUs - bad for scaling. It does make the
 RCU batches somewhat deterministic, as the RCU task can boost the readers
 to the rcu-task's priority.
 The problem about this approach is that everybody calling into RCU code
 have a worst case behaviour of the systemwide worst case RCU reader 
 section - which can be pretty large (in principle infinite if somebody.)
 So if somebody uses a call to a function in the code containing a RCU read
 area the worst case behavious would be the same as teh worst case latency
 in the simple world where preempt_disable()/preempt_enable() was used.

I missed something here -- readers would see the worst-case -writer-
latency rather than the worst-case -reader- latency, right?  Or are you
concerned about the case where some reader blocks the write-side
acquisitions in _synchronize_kernel(), but not before the writer has
grabbed a lock, blocking any readers on the corresponding CPU?

Yes, but this is true of every other lock in the system as well, not?
In a separate conversation a few weeks ago, Steve Hemminger suggested
placing checks in long RCU 

Re: Real-Time Preemption and RCU

2005-03-20 Thread Esben Nielsen
On Sun, 20 Mar 2005, Paul E. McKenney wrote:

> On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
> > On Fri, 18 Mar 2005, Ingo Molnar wrote:
> > 
> > > [...]
> > 
> > I think it can be deterministic (on the long timescale of memory 
> > management) 
> > anyway: Boost any non-RT task entering an RCU region to the lowest RT 
> > priority.
> > This way only all the RT tasks + one non-RT task can be within those
> > regions. The RT-tasks are supposed to have some kind of upper bound to
> > their CPU-usage. The non-RT task will also finish "soon" as it is
> > boosted. If the RCU batches are also at the lowest RT-priority they can be
> > run immediately after the non-RT task is done.
> 
> Hmmm...  Sort of a preemptive-first-strike priority boost.  Cute!  ;-)
> 
Well, I was actually thinking of an API like
 preempt_by_nonrt_disable()
 preempt_by_nonrt_enable()
working like the old preempt_disable()/preempt_enable() but still
allowing RT tasks (as well as priority inheriting non-RT tasks) to be
scheduled. That would kind of help "split" the kernel into two halfs: the
RT part and the non-RT part. The non-RT part would in many ways work as it
has always done.

> > > clearly the simplest solution is to go with the single-reader locks for
> > > now - a separate experiment could be done with a new type of rwlock that
> > > can only be used by the RCU code. (I'm not quite sure whether we could
> > > guarantee a minimum rate of RCU callback processing under such a scheme
> > > though. It's an eventual memory DoS otherwise.)
> > > 
> > 
> > Why are a lock needed at all? If it is doable without locking for an
> > non-preemptable SMP kernel it must be doable for an preemptable kernel as
> > well.I am convinced some kind of per-CPU rcu_read_count as I specified in
> > my previous mail can work some way or the other. call_rcu() might need to
> > do more complicated stuff and thus use CPU but call_rcu() is supposed to
> > be an relative rare event not worth optimizing for.  Such an
> > implementation will work for any preemptable kernel, not only PREEMPT_RT. 
> > For performance is considered it is important not to acquire any locks in
> > the rcu-read regions. 
> 
> You definitely don't need a lock -- you can just suppress preemption
> on the read side instead.  But that potentially makes for long scheduling
> latencies.

Well, in my patch I do not leave preemption off - only while doing the
simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
might be active, i.e. preempted, on the current CPU such that this isn't
and quiescent point after all.
(To others: Paul nicely unfolded my attachment below - I left it in
the mail such you can read it.)
The problem with this approach is ofcourse that user space programs might
preempt an RCU reader for a very long time such that RCU batches are never
really run. The boosting of non-RT tasks mentioned above would help a
lot.
A plus(?) in it: You can actually sleep while having the rcu_read_lock !!

> 
> The counter approach might work, and is also what the implementation #5
> does -- check out rcu_read_lock() in Ingo's most recent patch.
> 

Do you refer to your original mail with implementing it in 5 steps?
In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
forces syncronization between the CPUs - bad for scaling. It does make the
RCU batches somewhat deterministic, as the RCU task can boost the readers
to the rcu-task's priority.
The problem about this approach is that everybody calling into RCU code
have a worst case behaviour of the systemwide worst case RCU reader 
section - which can be pretty large (in principle infinite if somebody.)
So if somebody uses a call to a function in the code containing a RCU read
area the worst case behavious would be the same as teh worst case latency
in the simple world where preempt_disable()/preempt_enable() was used.

>   Thanx, Paul
> 
> > I tried this approach. My UP labtop did boot on it, but I haven't testet
> > it further. I have included the very small patch as an attachment.
> > 
> > >   Ingo
> > 
> > I have not yet looked at -V0.7.41-00...
> > 
> > Esben
> > 
> 
> > diff -Naur --exclude-from diff_exclude 
> > linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 
> > linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
> > --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h  2005-03-11 
> > 23:40:13.0 +0100
> > +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h  
> > 2005-03-19 12:47:09.0 +0100
> > @@ -85,6 +85,7 @@
> >   * curlist - current batch for which quiescent cycle started if any
> >   */
> >  struct rcu_data {
> > +   longactive_readers;
> > /* 1) quiescent state handling : */
> > longquiescbatch; /* Batch # for grace period */
> > int passed_quiesc;   /* User-mode/idle loop etc. */
> > @@ -115,12 +116,14 @@
> >  static inline void 

Re: Real-Time Preemption and RCU

2005-03-20 Thread Paul E. McKenney
On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
> On Fri, 18 Mar 2005, Ingo Molnar wrote:
> 
> > 
> > * Esben Nielsen <[EMAIL PROTECTED]> wrote:
> > 
> > > Why can should there only be one RCU-reader per CPU at each given
> > > instance? Even on a real-time UP system it would be very helpfull to
> > > have RCU areas to be enterable by several tasks as once. It would
> > > perform better, both wrt. latencies and throughput: With the above
> > > implementation an high priority task entering an RCU area will have to
> > > boost the current RCU reader, make a task switch until that one
> > > finishes and makes yet another task switch. to get back to the high
> > > priority task. With an RCU implementation which can take n RCU readers
> > > per CPU there is no such problem.
> > 
> > correct, for RCU we could allow multiple readers per lock, because the
> > 'blocking' side of RCU (callback processing) is never (supposed to be)
> > in any latency path.
> > 
> > except if someone wants to make RCU callback processing deterministic at
> > some point. (e.g. for memory management reasons.)
> 
> I think it can be deterministic (on the long timescale of memory management) 
> anyway: Boost any non-RT task entering an RCU region to the lowest RT 
> priority.
> This way only all the RT tasks + one non-RT task can be within those
> regions. The RT-tasks are supposed to have some kind of upper bound to
> their CPU-usage. The non-RT task will also finish "soon" as it is
> boosted. If the RCU batches are also at the lowest RT-priority they can be
> run immediately after the non-RT task is done.

Hmmm...  Sort of a preemptive-first-strike priority boost.  Cute!  ;-)

> > clearly the simplest solution is to go with the single-reader locks for
> > now - a separate experiment could be done with a new type of rwlock that
> > can only be used by the RCU code. (I'm not quite sure whether we could
> > guarantee a minimum rate of RCU callback processing under such a scheme
> > though. It's an eventual memory DoS otherwise.)
> > 
> 
> Why are a lock needed at all? If it is doable without locking for an
> non-preemptable SMP kernel it must be doable for an preemptable kernel as
> well.I am convinced some kind of per-CPU rcu_read_count as I specified in
> my previous mail can work some way or the other. call_rcu() might need to
> do more complicated stuff and thus use CPU but call_rcu() is supposed to
> be an relative rare event not worth optimizing for.  Such an
> implementation will work for any preemptable kernel, not only PREEMPT_RT. 
> For performance is considered it is important not to acquire any locks in
> the rcu-read regions. 

You definitely don't need a lock -- you can just suppress preemption
on the read side instead.  But that potentially makes for long scheduling
latencies.

The counter approach might work, and is also what the implementation #5
does -- check out rcu_read_lock() in Ingo's most recent patch.

Thanx, Paul

> I tried this approach. My UP labtop did boot on it, but I haven't testet
> it further. I have included the very small patch as an attachment.
> 
> > Ingo
> 
> I have not yet looked at -V0.7.41-00...
> 
> Esben
> 

> diff -Naur --exclude-from diff_exclude 
> linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 
> linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
> --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h2005-03-11 
> 23:40:13.0 +0100
> +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
> 2005-03-19 12:47:09.0 +0100
> @@ -85,6 +85,7 @@
>   * curlist - current batch for which quiescent cycle started if any
>   */
>  struct rcu_data {
> + longactive_readers;
>   /* 1) quiescent state handling : */
>   longquiescbatch; /* Batch # for grace period */
>   int passed_quiesc;   /* User-mode/idle loop etc. */
> @@ -115,12 +116,14 @@
>  static inline void rcu_qsctr_inc(int cpu)
>  {
>   struct rcu_data *rdp = _cpu(rcu_data, cpu);
> - rdp->passed_quiesc = 1;
> + if(rdp->active_readers==0)
> + rdp->passed_quiesc = 1;
>  }
>  static inline void rcu_bh_qsctr_inc(int cpu)
>  {
>   struct rcu_data *rdp = _cpu(rcu_bh_data, cpu);
> - rdp->passed_quiesc = 1;
> + if(rdp->active_readers==0)
> + rdp->passed_quiesc = 1;
>  }
>  
>  static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
> @@ -183,29 +186,27 @@
>   *
>   * It is illegal to block while in an RCU read-side critical section.
>   */
> -#define rcu_read_lock()  preempt_disable()
> +static inline void rcu_read_lock(void)
> +{
> + preempt_disable(); 
> + __get_cpu_var(rcu_data).active_readers++;
> + preempt_enable();
> +}
>  
>  /**
>   * rcu_read_unlock - marks the end of an RCU read-side critical section.
>   *
>   * See rcu_read_lock() for more information.
>   */
> -#define rcu_read_unlock()

Re: Real-Time Preemption and RCU

2005-03-20 Thread hui
On Sun, Mar 20, 2005 at 01:38:24PM -0800, Bill Huey wrote:
> On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote:
> > That was just one random example.
> > Another one would be :
> > 
> > drivers/chat/tty_io.c, __do_SAK() contains
> >read_lock(_lock);
> >task_lock(p);
> > 
> > kernel/sys.c, sys_setrlimit contains
> >task_lock(current->group_leader);
> >read_lock(_lock);
> > 
> > task_lock is a shorthand for spin_lock(>alloc_lock). If read_lock is 
> > a normal spinlock, then this is an A/B B/A deadlock.
> 
> That code was already dubious in the first place just because it
> contained that circularity. If you had a rwlock that block on an
> upper read count maximum[,] a deadlock situation would trigger anyways,
> say, upon a flood of threads trying to do that sequence of aquires.

The RT patch uses the lock ordering "in place" and whatevery nasty
situation was going on previously will be effectively under high load,
which increases the chance of it being triggered. Removal of the read
side semantic just increases load more so that those cases can trigger.

I disagree with this approach and I have an alternate implementation
here that restores it. It's only half tested and fairly meaningless
until an extreme contention case is revealed with the current rt lock
implementation. Numbers need to be gather to prove or disprove this
conjecture.

> I'd probably experiment with using the {spin,read,write}-trylock
> logic and release the all locks contains in a sequence like that
> on the failure to aquire any of the locks in the chain as an
> initial fix. A longer term fix might be to break things up a bit
> so that whatever ordering being done would have that circularity.

Excuse me, ...would *not* have that circularity.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread hui
On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote:
> That was just one random example.
> Another one would be :
> 
> drivers/chat/tty_io.c, __do_SAK() contains
>read_lock(_lock);
>task_lock(p);
> 
> kernel/sys.c, sys_setrlimit contains
>task_lock(current->group_leader);
>read_lock(_lock);
> 
> task_lock is a shorthand for spin_lock(>alloc_lock). If read_lock is 
> a normal spinlock, then this is an A/B B/A deadlock.

That code was already dubious in the first place just because it
contained that circularity. If you had a rwlock that block on an
upper read count maximum a deadlock situation would trigger anyways,
say, upon a flood of threads trying to do that sequence of aquires.

I'd probably experiment with using the {spin,read,write}-trylock
logic and release the all locks contains in a sequence like that
on the failure to aquire any of the locks in the chain as an
initial fix. A longer term fix might be to break things up a bit
so that whatever ordering being done would have that circularity.

BTW, the runtime lock cricularity detector was designed to trigger
on that situtation anyways.

That's my thoughts on the matter.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread Manfred Spraul
Thomas Gleixner wrote:
On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote:
 

cpu 1:
acquire random networking spin_lock_bh()
cpu 2:
read_lock(_lock) from process context
interrupt. softirq. within softirq: try to acquire the networking lock.
* spins.
cpu 1:
hardware interrupt
within hw interrupt: signal delivery. tries to acquire tasklist_lock.
--> deadlock.
   

Signal delivery from hw interrupt context (interrupt is flagged
SA_NODELAY) is not possible in RT preemption mode. The
local_irq_save_nort() check in __cache_alloc will catch you.
 

That was just one random example.
Another one would be :
drivers/chat/tty_io.c, __do_SAK() contains
   read_lock(_lock);
   task_lock(p);
kernel/sys.c, sys_setrlimit contains
   task_lock(current->group_leader);
   read_lock(_lock);
task_lock is a shorthand for spin_lock(>alloc_lock). If read_lock is 
a normal spinlock, then this is an A/B B/A deadlock.

--
   Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread Esben Nielsen
On Fri, 18 Mar 2005, Ingo Molnar wrote:

> 
> * Esben Nielsen <[EMAIL PROTECTED]> wrote:
> 
> > Why can should there only be one RCU-reader per CPU at each given
> > instance? Even on a real-time UP system it would be very helpfull to
> > have RCU areas to be enterable by several tasks as once. It would
> > perform better, both wrt. latencies and throughput: With the above
> > implementation an high priority task entering an RCU area will have to
> > boost the current RCU reader, make a task switch until that one
> > finishes and makes yet another task switch. to get back to the high
> > priority task. With an RCU implementation which can take n RCU readers
> > per CPU there is no such problem.
> 
> correct, for RCU we could allow multiple readers per lock, because the
> 'blocking' side of RCU (callback processing) is never (supposed to be)
> in any latency path.
> 
> except if someone wants to make RCU callback processing deterministic at
> some point. (e.g. for memory management reasons.)

I think it can be deterministic (on the long timescale of memory management) 
anyway: Boost any non-RT task entering an RCU region to the lowest RT priority.
This way only all the RT tasks + one non-RT task can be within those
regions. The RT-tasks are supposed to have some kind of upper bound to
their CPU-usage. The non-RT task will also finish "soon" as it is
boosted. If the RCU batches are also at the lowest RT-priority they can be
run immediately after the non-RT task is done.

> 
> clearly the simplest solution is to go with the single-reader locks for
> now - a separate experiment could be done with a new type of rwlock that
> can only be used by the RCU code. (I'm not quite sure whether we could
> guarantee a minimum rate of RCU callback processing under such a scheme
> though. It's an eventual memory DoS otherwise.)
> 

Why are a lock needed at all? If it is doable without locking for an
non-preemptable SMP kernel it must be doable for an preemptable kernel as
well.I am convinced some kind of per-CPU rcu_read_count as I specified in
my previous mail can work some way or the other. call_rcu() might need to
do more complicated stuff and thus use CPU but call_rcu() is supposed to
be an relative rare event not worth optimizing for.  Such an
implementation will work for any preemptable kernel, not only PREEMPT_RT. 
For performance is considered it is important not to acquire any locks in
the rcu-read regions. 

I tried this approach. My UP labtop did boot on it, but I haven't testet
it further. I have included the very small patch as an attachment.

>   Ingo

I have not yet looked at -V0.7.41-00...

Esben

diff -Naur --exclude-from diff_exclude 
linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 
linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
--- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h  2005-03-11 
23:40:13.0 +0100
+++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h  2005-03-19 
12:47:09.0 +0100
@@ -85,6 +85,7 @@
  * curlist - current batch for which quiescent cycle started if any
  */
 struct rcu_data {
+   longactive_readers;
/* 1) quiescent state handling : */
longquiescbatch; /* Batch # for grace period */
int passed_quiesc;   /* User-mode/idle loop etc. */
@@ -115,12 +116,14 @@
 static inline void rcu_qsctr_inc(int cpu)
 {
struct rcu_data *rdp = _cpu(rcu_data, cpu);
-   rdp->passed_quiesc = 1;
+   if(rdp->active_readers==0)
+   rdp->passed_quiesc = 1;
 }
 static inline void rcu_bh_qsctr_inc(int cpu)
 {
struct rcu_data *rdp = _cpu(rcu_bh_data, cpu);
-   rdp->passed_quiesc = 1;
+   if(rdp->active_readers==0)
+   rdp->passed_quiesc = 1;
 }
 
 static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
@@ -183,29 +186,27 @@
  *
  * It is illegal to block while in an RCU read-side critical section.
  */
-#define rcu_read_lock()preempt_disable()
+static inline void rcu_read_lock(void)
+{  
+   preempt_disable(); 
+   __get_cpu_var(rcu_data).active_readers++;
+   preempt_enable();
+}
 
 /**
  * rcu_read_unlock - marks the end of an RCU read-side critical section.
  *
  * See rcu_read_lock() for more information.
  */
-#define rcu_read_unlock()  preempt_enable()
+static inline void rcu_read_unlock(void)
+{  
+   preempt_disable(); 
+   __get_cpu_var(rcu_data).active_readers--;
+   preempt_enable();
+}
 
 #define IGNORE_LOCK(op, lock)  do { (void)(lock); op(); } while (0)
 
-#ifdef CONFIG_PREEMPT_RT
-# define rcu_read_lock_spin(lock)  spin_lock(lock)
-# define rcu_read_unlock_spin(lock)spin_unlock(lock)
-# define rcu_read_lock_read(lock)  read_lock(lock)
-# define rcu_read_unlock_read(lock)read_unlock(lock)
-# define rcu_read_lock_bh_read(lock)   read_lock_bh(lock)
-# define rcu_read_unlock_bh_read(lock) read_unlock_bh(lock)
-# define rcu_read_lock_down_read(rwsem) 

Re: Real-Time Preemption and RCU

2005-03-20 Thread Thomas Gleixner
On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote:
> cpu 1:
> acquire random networking spin_lock_bh()
> 
> cpu 2:
> read_lock(_lock) from process context
> interrupt. softirq. within softirq: try to acquire the networking lock.
> * spins.
> 
> cpu 1:
> hardware interrupt
> within hw interrupt: signal delivery. tries to acquire tasklist_lock.
> 
> --> deadlock.

Signal delivery from hw interrupt context (interrupt is flagged
SA_NODELAY) is not possible in RT preemption mode. The
local_irq_save_nort() check in __cache_alloc will catch you.

When it happens from a threaded irq handler the situation is solvable by
the PI code.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread Kyle Moffett
On Mar 19, 2005, at 11:31, Ingo Molnar wrote:
What about allowing only as many concurrent readers as there are CPUs?
since a reader may be preempted by a higher prio task, there is no
linear relationship between CPU utilization and the number of readers
allowed. You could easily end up having all the nr_cpus readers
preempted on one CPU. It gets pretty messy
One solution I can think of, although it bloats memory usage for 
many-way
boxen, is to just have a table in the rwlock with one entry per cpu.  
Each
CPU would get one concurrent reader, others would need to sleep

Cheers,
Kyle Moffett
-BEGIN GEEK CODE BLOCK-
Version: 3.12
GCM/CS/IT/U d- s++: a18 C>$ UB/L/X/*(+)>$ P+++()>$
L(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b(++) DI+ D+ G e->$ h!*()>++$ r  
!y?(-)
--END GEEK CODE BLOCK--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread Kyle Moffett
On Mar 19, 2005, at 11:31, Ingo Molnar wrote:
What about allowing only as many concurrent readers as there are CPUs?
since a reader may be preempted by a higher prio task, there is no
linear relationship between CPU utilization and the number of readers
allowed. You could easily end up having all the nr_cpus readers
preempted on one CPU. It gets pretty messy
One solution I can think of, although it bloats memory usage for 
many-way
boxen, is to just have a table in the rwlock with one entry per cpu.  
Each
CPU would get one concurrent reader, others would need to sleep

Cheers,
Kyle Moffett
-BEGIN GEEK CODE BLOCK-
Version: 3.12
GCM/CS/IT/U d- s++: a18 C$ UB/L/X/*(+)$ P+++()$
L(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b(++) DI+ D+ G e-$ h!*()++$ r  
!y?(-)
--END GEEK CODE BLOCK--

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread Thomas Gleixner
On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote:
 cpu 1:
 acquire random networking spin_lock_bh()
 
 cpu 2:
 read_lock(tasklist_lock) from process context
 interrupt. softirq. within softirq: try to acquire the networking lock.
 * spins.
 
 cpu 1:
 hardware interrupt
 within hw interrupt: signal delivery. tries to acquire tasklist_lock.
 
 -- deadlock.

Signal delivery from hw interrupt context (interrupt is flagged
SA_NODELAY) is not possible in RT preemption mode. The
local_irq_save_nort() check in __cache_alloc will catch you.

When it happens from a threaded irq handler the situation is solvable by
the PI code.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread Esben Nielsen
On Fri, 18 Mar 2005, Ingo Molnar wrote:

 
 * Esben Nielsen [EMAIL PROTECTED] wrote:
 
  Why can should there only be one RCU-reader per CPU at each given
  instance? Even on a real-time UP system it would be very helpfull to
  have RCU areas to be enterable by several tasks as once. It would
  perform better, both wrt. latencies and throughput: With the above
  implementation an high priority task entering an RCU area will have to
  boost the current RCU reader, make a task switch until that one
  finishes and makes yet another task switch. to get back to the high
  priority task. With an RCU implementation which can take n RCU readers
  per CPU there is no such problem.
 
 correct, for RCU we could allow multiple readers per lock, because the
 'blocking' side of RCU (callback processing) is never (supposed to be)
 in any latency path.
 
 except if someone wants to make RCU callback processing deterministic at
 some point. (e.g. for memory management reasons.)

I think it can be deterministic (on the long timescale of memory management) 
anyway: Boost any non-RT task entering an RCU region to the lowest RT priority.
This way only all the RT tasks + one non-RT task can be within those
regions. The RT-tasks are supposed to have some kind of upper bound to
their CPU-usage. The non-RT task will also finish soon as it is
boosted. If the RCU batches are also at the lowest RT-priority they can be
run immediately after the non-RT task is done.

 
 clearly the simplest solution is to go with the single-reader locks for
 now - a separate experiment could be done with a new type of rwlock that
 can only be used by the RCU code. (I'm not quite sure whether we could
 guarantee a minimum rate of RCU callback processing under such a scheme
 though. It's an eventual memory DoS otherwise.)
 

Why are a lock needed at all? If it is doable without locking for an
non-preemptable SMP kernel it must be doable for an preemptable kernel as
well.I am convinced some kind of per-CPU rcu_read_count as I specified in
my previous mail can work some way or the other. call_rcu() might need to
do more complicated stuff and thus use CPU but call_rcu() is supposed to
be an relative rare event not worth optimizing for.  Such an
implementation will work for any preemptable kernel, not only PREEMPT_RT. 
For performance is considered it is important not to acquire any locks in
the rcu-read regions. 

I tried this approach. My UP labtop did boot on it, but I haven't testet
it further. I have included the very small patch as an attachment.

   Ingo

I have not yet looked at -V0.7.41-00...

Esben

diff -Naur --exclude-from diff_exclude 
linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 
linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
--- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h  2005-03-11 
23:40:13.0 +0100
+++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h  2005-03-19 
12:47:09.0 +0100
@@ -85,6 +85,7 @@
  * curlist - current batch for which quiescent cycle started if any
  */
 struct rcu_data {
+   longactive_readers;
/* 1) quiescent state handling : */
longquiescbatch; /* Batch # for grace period */
int passed_quiesc;   /* User-mode/idle loop etc. */
@@ -115,12 +116,14 @@
 static inline void rcu_qsctr_inc(int cpu)
 {
struct rcu_data *rdp = per_cpu(rcu_data, cpu);
-   rdp-passed_quiesc = 1;
+   if(rdp-active_readers==0)
+   rdp-passed_quiesc = 1;
 }
 static inline void rcu_bh_qsctr_inc(int cpu)
 {
struct rcu_data *rdp = per_cpu(rcu_bh_data, cpu);
-   rdp-passed_quiesc = 1;
+   if(rdp-active_readers==0)
+   rdp-passed_quiesc = 1;
 }
 
 static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
@@ -183,29 +186,27 @@
  *
  * It is illegal to block while in an RCU read-side critical section.
  */
-#define rcu_read_lock()preempt_disable()
+static inline void rcu_read_lock(void)
+{  
+   preempt_disable(); 
+   __get_cpu_var(rcu_data).active_readers++;
+   preempt_enable();
+}
 
 /**
  * rcu_read_unlock - marks the end of an RCU read-side critical section.
  *
  * See rcu_read_lock() for more information.
  */
-#define rcu_read_unlock()  preempt_enable()
+static inline void rcu_read_unlock(void)
+{  
+   preempt_disable(); 
+   __get_cpu_var(rcu_data).active_readers--;
+   preempt_enable();
+}
 
 #define IGNORE_LOCK(op, lock)  do { (void)(lock); op(); } while (0)
 
-#ifdef CONFIG_PREEMPT_RT
-# define rcu_read_lock_spin(lock)  spin_lock(lock)
-# define rcu_read_unlock_spin(lock)spin_unlock(lock)
-# define rcu_read_lock_read(lock)  read_lock(lock)
-# define rcu_read_unlock_read(lock)read_unlock(lock)
-# define rcu_read_lock_bh_read(lock)   read_lock_bh(lock)
-# define rcu_read_unlock_bh_read(lock) read_unlock_bh(lock)
-# define rcu_read_lock_down_read(rwsem)down_read(rwsem)
-# define 

Re: Real-Time Preemption and RCU

2005-03-20 Thread Manfred Spraul
Thomas Gleixner wrote:
On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote:
 

cpu 1:
acquire random networking spin_lock_bh()
cpu 2:
read_lock(tasklist_lock) from process context
interrupt. softirq. within softirq: try to acquire the networking lock.
* spins.
cpu 1:
hardware interrupt
within hw interrupt: signal delivery. tries to acquire tasklist_lock.
-- deadlock.
   

Signal delivery from hw interrupt context (interrupt is flagged
SA_NODELAY) is not possible in RT preemption mode. The
local_irq_save_nort() check in __cache_alloc will catch you.
 

That was just one random example.
Another one would be :
drivers/chat/tty_io.c, __do_SAK() contains
   read_lock(tasklist_lock);
   task_lock(p);
kernel/sys.c, sys_setrlimit contains
   task_lock(current-group_leader);
   read_lock(tasklist_lock);
task_lock is a shorthand for spin_lock(p-alloc_lock). If read_lock is 
a normal spinlock, then this is an A/B B/A deadlock.

--
   Manfred
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread hui
On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote:
 That was just one random example.
 Another one would be :
 
 drivers/chat/tty_io.c, __do_SAK() contains
read_lock(tasklist_lock);
task_lock(p);
 
 kernel/sys.c, sys_setrlimit contains
task_lock(current-group_leader);
read_lock(tasklist_lock);
 
 task_lock is a shorthand for spin_lock(p-alloc_lock). If read_lock is 
 a normal spinlock, then this is an A/B B/A deadlock.

That code was already dubious in the first place just because it
contained that circularity. If you had a rwlock that block on an
upper read count maximum a deadlock situation would trigger anyways,
say, upon a flood of threads trying to do that sequence of aquires.

I'd probably experiment with using the {spin,read,write}-trylock
logic and release the all locks contains in a sequence like that
on the failure to aquire any of the locks in the chain as an
initial fix. A longer term fix might be to break things up a bit
so that whatever ordering being done would have that circularity.

BTW, the runtime lock cricularity detector was designed to trigger
on that situtation anyways.

That's my thoughts on the matter.

bill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread hui
On Sun, Mar 20, 2005 at 01:38:24PM -0800, Bill Huey wrote:
 On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote:
  That was just one random example.
  Another one would be :
  
  drivers/chat/tty_io.c, __do_SAK() contains
 read_lock(tasklist_lock);
 task_lock(p);
  
  kernel/sys.c, sys_setrlimit contains
 task_lock(current-group_leader);
 read_lock(tasklist_lock);
  
  task_lock is a shorthand for spin_lock(p-alloc_lock). If read_lock is 
  a normal spinlock, then this is an A/B B/A deadlock.
 
 That code was already dubious in the first place just because it
 contained that circularity. If you had a rwlock that block on an
 upper read count maximum[,] a deadlock situation would trigger anyways,
 say, upon a flood of threads trying to do that sequence of aquires.

The RT patch uses the lock ordering in place and whatevery nasty
situation was going on previously will be effectively under high load,
which increases the chance of it being triggered. Removal of the read
side semantic just increases load more so that those cases can trigger.

I disagree with this approach and I have an alternate implementation
here that restores it. It's only half tested and fairly meaningless
until an extreme contention case is revealed with the current rt lock
implementation. Numbers need to be gather to prove or disprove this
conjecture.

 I'd probably experiment with using the {spin,read,write}-trylock
 logic and release the all locks contains in a sequence like that
 on the failure to aquire any of the locks in the chain as an
 initial fix. A longer term fix might be to break things up a bit
 so that whatever ordering being done would have that circularity.

Excuse me, ...would *not* have that circularity.

bill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-20 Thread Paul E. McKenney
On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
 On Fri, 18 Mar 2005, Ingo Molnar wrote:
 
  
  * Esben Nielsen [EMAIL PROTECTED] wrote:
  
   Why can should there only be one RCU-reader per CPU at each given
   instance? Even on a real-time UP system it would be very helpfull to
   have RCU areas to be enterable by several tasks as once. It would
   perform better, both wrt. latencies and throughput: With the above
   implementation an high priority task entering an RCU area will have to
   boost the current RCU reader, make a task switch until that one
   finishes and makes yet another task switch. to get back to the high
   priority task. With an RCU implementation which can take n RCU readers
   per CPU there is no such problem.
  
  correct, for RCU we could allow multiple readers per lock, because the
  'blocking' side of RCU (callback processing) is never (supposed to be)
  in any latency path.
  
  except if someone wants to make RCU callback processing deterministic at
  some point. (e.g. for memory management reasons.)
 
 I think it can be deterministic (on the long timescale of memory management) 
 anyway: Boost any non-RT task entering an RCU region to the lowest RT 
 priority.
 This way only all the RT tasks + one non-RT task can be within those
 regions. The RT-tasks are supposed to have some kind of upper bound to
 their CPU-usage. The non-RT task will also finish soon as it is
 boosted. If the RCU batches are also at the lowest RT-priority they can be
 run immediately after the non-RT task is done.

Hmmm...  Sort of a preemptive-first-strike priority boost.  Cute!  ;-)

  clearly the simplest solution is to go with the single-reader locks for
  now - a separate experiment could be done with a new type of rwlock that
  can only be used by the RCU code. (I'm not quite sure whether we could
  guarantee a minimum rate of RCU callback processing under such a scheme
  though. It's an eventual memory DoS otherwise.)
  
 
 Why are a lock needed at all? If it is doable without locking for an
 non-preemptable SMP kernel it must be doable for an preemptable kernel as
 well.I am convinced some kind of per-CPU rcu_read_count as I specified in
 my previous mail can work some way or the other. call_rcu() might need to
 do more complicated stuff and thus use CPU but call_rcu() is supposed to
 be an relative rare event not worth optimizing for.  Such an
 implementation will work for any preemptable kernel, not only PREEMPT_RT. 
 For performance is considered it is important not to acquire any locks in
 the rcu-read regions. 

You definitely don't need a lock -- you can just suppress preemption
on the read side instead.  But that potentially makes for long scheduling
latencies.

The counter approach might work, and is also what the implementation #5
does -- check out rcu_read_lock() in Ingo's most recent patch.

Thanx, Paul

 I tried this approach. My UP labtop did boot on it, but I haven't testet
 it further. I have included the very small patch as an attachment.
 
  Ingo
 
 I have not yet looked at -V0.7.41-00...
 
 Esben
 

 diff -Naur --exclude-from diff_exclude 
 linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 
 linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
 --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h2005-03-11 
 23:40:13.0 +0100
 +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
 2005-03-19 12:47:09.0 +0100
 @@ -85,6 +85,7 @@
   * curlist - current batch for which quiescent cycle started if any
   */
  struct rcu_data {
 + longactive_readers;
   /* 1) quiescent state handling : */
   longquiescbatch; /* Batch # for grace period */
   int passed_quiesc;   /* User-mode/idle loop etc. */
 @@ -115,12 +116,14 @@
  static inline void rcu_qsctr_inc(int cpu)
  {
   struct rcu_data *rdp = per_cpu(rcu_data, cpu);
 - rdp-passed_quiesc = 1;
 + if(rdp-active_readers==0)
 + rdp-passed_quiesc = 1;
  }
  static inline void rcu_bh_qsctr_inc(int cpu)
  {
   struct rcu_data *rdp = per_cpu(rcu_bh_data, cpu);
 - rdp-passed_quiesc = 1;
 + if(rdp-active_readers==0)
 + rdp-passed_quiesc = 1;
  }
  
  static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
 @@ -183,29 +186,27 @@
   *
   * It is illegal to block while in an RCU read-side critical section.
   */
 -#define rcu_read_lock()  preempt_disable()
 +static inline void rcu_read_lock(void)
 +{
 + preempt_disable(); 
 + __get_cpu_var(rcu_data).active_readers++;
 + preempt_enable();
 +}
  
  /**
   * rcu_read_unlock - marks the end of an RCU read-side critical section.
   *
   * See rcu_read_lock() for more information.
   */
 -#define rcu_read_unlock()preempt_enable()
 +static inline void rcu_read_unlock(void)
 +{
 + preempt_disable(); 
 + __get_cpu_var(rcu_data).active_readers--;
 +   

Re: Real-Time Preemption and RCU

2005-03-20 Thread Esben Nielsen
On Sun, 20 Mar 2005, Paul E. McKenney wrote:

 On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
  On Fri, 18 Mar 2005, Ingo Molnar wrote:
  
   [...]
  
  I think it can be deterministic (on the long timescale of memory 
  management) 
  anyway: Boost any non-RT task entering an RCU region to the lowest RT 
  priority.
  This way only all the RT tasks + one non-RT task can be within those
  regions. The RT-tasks are supposed to have some kind of upper bound to
  their CPU-usage. The non-RT task will also finish soon as it is
  boosted. If the RCU batches are also at the lowest RT-priority they can be
  run immediately after the non-RT task is done.
 
 Hmmm...  Sort of a preemptive-first-strike priority boost.  Cute!  ;-)
 
Well, I was actually thinking of an API like
 preempt_by_nonrt_disable()
 preempt_by_nonrt_enable()
working like the old preempt_disable()/preempt_enable() but still
allowing RT tasks (as well as priority inheriting non-RT tasks) to be
scheduled. That would kind of help split the kernel into two halfs: the
RT part and the non-RT part. The non-RT part would in many ways work as it
has always done.

   clearly the simplest solution is to go with the single-reader locks for
   now - a separate experiment could be done with a new type of rwlock that
   can only be used by the RCU code. (I'm not quite sure whether we could
   guarantee a minimum rate of RCU callback processing under such a scheme
   though. It's an eventual memory DoS otherwise.)
   
  
  Why are a lock needed at all? If it is doable without locking for an
  non-preemptable SMP kernel it must be doable for an preemptable kernel as
  well.I am convinced some kind of per-CPU rcu_read_count as I specified in
  my previous mail can work some way or the other. call_rcu() might need to
  do more complicated stuff and thus use CPU but call_rcu() is supposed to
  be an relative rare event not worth optimizing for.  Such an
  implementation will work for any preemptable kernel, not only PREEMPT_RT. 
  For performance is considered it is important not to acquire any locks in
  the rcu-read regions. 
 
 You definitely don't need a lock -- you can just suppress preemption
 on the read side instead.  But that potentially makes for long scheduling
 latencies.

Well, in my patch I do not leave preemption off - only while doing the
simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
might be active, i.e. preempted, on the current CPU such that this isn't
and quiescent point after all.
(To others: Paul nicely unfolded my attachment below - I left it in
the mail such you can read it.)
The problem with this approach is ofcourse that user space programs might
preempt an RCU reader for a very long time such that RCU batches are never
really run. The boosting of non-RT tasks mentioned above would help a
lot.
A plus(?) in it: You can actually sleep while having the rcu_read_lock !!

 
 The counter approach might work, and is also what the implementation #5
 does -- check out rcu_read_lock() in Ingo's most recent patch.
 

Do you refer to your original mail with implementing it in 5 steps?
In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
forces syncronization between the CPUs - bad for scaling. It does make the
RCU batches somewhat deterministic, as the RCU task can boost the readers
to the rcu-task's priority.
The problem about this approach is that everybody calling into RCU code
have a worst case behaviour of the systemwide worst case RCU reader 
section - which can be pretty large (in principle infinite if somebody.)
So if somebody uses a call to a function in the code containing a RCU read
area the worst case behavious would be the same as teh worst case latency
in the simple world where preempt_disable()/preempt_enable() was used.

   Thanx, Paul
 
  I tried this approach. My UP labtop did boot on it, but I haven't testet
  it further. I have included the very small patch as an attachment.
  
 Ingo
  
  I have not yet looked at -V0.7.41-00...
  
  Esben
  
 
  diff -Naur --exclude-from diff_exclude 
  linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 
  linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
  --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h  2005-03-11 
  23:40:13.0 +0100
  +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h  
  2005-03-19 12:47:09.0 +0100
  @@ -85,6 +85,7 @@
* curlist - current batch for which quiescent cycle started if any
*/
   struct rcu_data {
  +   longactive_readers;
  /* 1) quiescent state handling : */
  longquiescbatch; /* Batch # for grace period */
  int passed_quiesc;   /* User-mode/idle loop etc. */
  @@ -115,12 +116,14 @@
   static inline void rcu_qsctr_inc(int cpu)
   {
  struct rcu_data *rdp = per_cpu(rcu_data, cpu);
  -   rdp-passed_quiesc = 1;
  +   

Re: Real-Time Preemption and RCU

2005-03-19 Thread Manfred Spraul
Ingo Molnar wrote:
which precise locking situation do you mean?
 

cpu 1:
acquire random networking spin_lock_bh()
cpu 2:
read_lock(_lock) from process context
interrupt. softirq. within softirq: try to acquire the networking lock.
* spins.
cpu 1:
hardware interrupt
within hw interrupt: signal delivery. tries to acquire tasklist_lock.
--> deadlock.
--
   Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-19 Thread Ingo Molnar

* Herbert Xu <[EMAIL PROTECTED]> wrote:

> Ingo Molnar <[EMAIL PROTECTED]> wrote:
> >
> > i really have no intention to allow multiple readers for rt-mutexes. We
> > got away with that so far, and i'd like to keep it so. Imagine 100
> > threads all blocked in the same critical section (holding the read-lock)
> > when a highprio writer thread comes around: instant 100x latency to let
> > all of them roll forward. The only sane solution is to not allow
> > excessive concurrency. (That limits SMP scalability, but there's no
> > other choice i can see.)
> 
> What about allowing only as many concurrent readers as there are CPUs?

since a reader may be preempted by a higher prio task, there is no
linear relationship between CPU utilization and the number of readers
allowed. You could easily end up having all the nr_cpus readers
preempted on one CPU. It gets pretty messy.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-19 Thread Ingo Molnar

* Manfred Spraul <[EMAIL PROTECTED]> wrote:

> Ingo Molnar wrote:
> 
> > read_lock();
> > ...
> > read_lock();
> >
> >are still legal. (it's also done quite often.)
> >
> > 
> >
>
> How do you handle the write_lock_irq()/read_lock locks? E.g. the
> tasklist_lock or the fasync_lock.

which precise locking situation do you mean?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-19 Thread Ingo Molnar

* Manfred Spraul [EMAIL PROTECTED] wrote:

 Ingo Molnar wrote:
 
  read_lock(rwlock);
  ...
  read_lock(rwlock);
 
 are still legal. (it's also done quite often.)
 
  
 

 How do you handle the write_lock_irq()/read_lock locks? E.g. the
 tasklist_lock or the fasync_lock.

which precise locking situation do you mean?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-19 Thread Ingo Molnar

* Herbert Xu [EMAIL PROTECTED] wrote:

 Ingo Molnar [EMAIL PROTECTED] wrote:
 
  i really have no intention to allow multiple readers for rt-mutexes. We
  got away with that so far, and i'd like to keep it so. Imagine 100
  threads all blocked in the same critical section (holding the read-lock)
  when a highprio writer thread comes around: instant 100x latency to let
  all of them roll forward. The only sane solution is to not allow
  excessive concurrency. (That limits SMP scalability, but there's no
  other choice i can see.)
 
 What about allowing only as many concurrent readers as there are CPUs?

since a reader may be preempted by a higher prio task, there is no
linear relationship between CPU utilization and the number of readers
allowed. You could easily end up having all the nr_cpus readers
preempted on one CPU. It gets pretty messy.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-19 Thread Manfred Spraul
Ingo Molnar wrote:
which precise locking situation do you mean?
 

cpu 1:
acquire random networking spin_lock_bh()
cpu 2:
read_lock(tasklist_lock) from process context
interrupt. softirq. within softirq: try to acquire the networking lock.
* spins.
cpu 1:
hardware interrupt
within hw interrupt: signal delivery. tries to acquire tasklist_lock.
-- deadlock.
--
   Manfred
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Manfred Spraul
Ingo Molnar wrote:
read_lock();
...
read_lock();
are still legal. (it's also done quite often.)
 

How do you handle the write_lock_irq()/read_lock locks?
E.g. the tasklist_lock or the fasync_lock.
--
   Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 02:22:30PM -0800, Paul E. McKenney wrote:
> On Fri, Mar 18, 2005 at 12:35:17PM -0800, Paul E. McKenney wrote:
> > Compiles, probably dies horribly.  "diff" didn't do such a good job
> > on this one, so attaching the raw rcupdate.[hc] files as well.
> 
> My prediction was all too accurate.  ;-)
> 
> The attached patch at least boots on a 1-CPU x86 box.  I added some
> interrupt disabling that is a bad idea in real-time preempt kernels,
> but necessary for stock kernels to even have a ghost of a chance.
> 
> Again, the diff is quite confusing to read (for me, anyway!), so attached
> the rcupdate.[hc] files.
> 
> Assuming this patch survives the LTP run (hah!!!), next step is a small
> SMP system.

It did survive the UP LTP run, here is an updated patch with a fix needed
for SMP.  Having trouble with test system (independent of this patch),
hopefully will get more testing in over the weekend.

Thanx, Paul
 
Signed-off-by: <[EMAIL PROTECTED]>

diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h 
linux-2.5-rtRCU/include/linux/rcupdate.h
--- linux-2.5/include/linux/rcupdate.h  Wed Mar  9 12:37:06 2005
+++ linux-2.5-rtRCU/include/linux/rcupdate.hFri Mar 18 11:37:02 2005
@@ -58,169 +58,11 @@ struct rcu_head {
(ptr)->next = NULL; (ptr)->func = NULL; \
 } while (0)
 
-
-
-/* Global control variables for rcupdate callback mechanism. */
-struct rcu_ctrlblk {
-   longcur;/* Current batch number.  */
-   longcompleted;  /* Number of the last completed batch */
-   int next_pending;   /* Is the next batch already waiting? */
-} cacheline_maxaligned_in_smp;
-
-/* Is batch a before batch b ? */
-static inline int rcu_batch_before(long a, long b)
-{
-return (a - b) < 0;
-}
-
-/* Is batch a after batch b ? */
-static inline int rcu_batch_after(long a, long b)
-{
-return (a - b) > 0;
-}
-
-/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
-struct rcu_data {
-   /* 1) quiescent state handling : */
-   longquiescbatch; /* Batch # for grace period */
-   int passed_quiesc;   /* User-mode/idle loop etc. */
-   int qs_pending;  /* core waits for quiesc state */
-
-   /* 2) batch handling */
-   longbatch;   /* Batch # for current RCU batch */
-   struct rcu_head *nxtlist;
-   struct rcu_head **nxttail;
-   struct rcu_head *curlist;
-   struct rcu_head **curtail;
-   struct rcu_head *donelist;
-   struct rcu_head **donetail;
-   int cpu;
-};
-
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-extern struct rcu_ctrlblk rcu_ctrlblk;
-extern struct rcu_ctrlblk rcu_bh_ctrlblk;
-
-/*
- * Increment the quiescent state counter.
- * The counter is a bit degenerated: We do not need to know
- * how many quiescent states passed, just if there was at least
- * one since the start of the grace period. Thus just a flag.
- */
-static inline void rcu_qsctr_inc(int cpu)
-{
-   struct rcu_data *rdp = _cpu(rcu_data, cpu);
-   rdp->passed_quiesc = 1;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
-   struct rcu_data *rdp = _cpu(rcu_bh_data, cpu);
-   rdp->passed_quiesc = 1;
-}
-
-static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
-   struct rcu_data *rdp)
-{
-   /* This cpu has pending rcu entries and the grace period
-* for them has completed.
-*/
-   if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
-   return 1;
-
-   /* This cpu has no pending entries, but there are new entries */
-   if (!rdp->curlist && rdp->nxtlist)
-   return 1;
-
-   /* This cpu has finished callbacks to invoke */
-   if (rdp->donelist)
-   return 1;
-
-   /* The rcu core waits for a quiescent state from the cpu */
-   if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
-   return 1;
-
-   /* nothing to do */
-   return 0;
-}
-
-static inline int rcu_pending(int cpu)
-{
-   return __rcu_pending(_ctrlblk, _cpu(rcu_data, cpu)) ||
-   __rcu_pending(_bh_ctrlblk, _cpu(rcu_bh_data, cpu));
-}
-
-/**
- * rcu_read_lock - mark the beginning of an RCU read-side critical section.
- *
- * When synchronize_kernel() is invoked on one CPU while other CPUs
- * are within RCU read-side critical sections, then the
- * synchronize_kernel() is guaranteed to block until after all the other
- * CPUs exit their critical sections.  Similarly, if call_rcu() is invoked
- * on one CPU while other CPUs are within RCU read-side critical
- * sections, invocation of the corresponding RCU callback is deferred
- * until after the all the other CPUs exit 

Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 12:35:17PM -0800, Paul E. McKenney wrote:
> Compiles, probably dies horribly.  "diff" didn't do such a good job
> on this one, so attaching the raw rcupdate.[hc] files as well.

My prediction was all too accurate.  ;-)

The attached patch at least boots on a 1-CPU x86 box.  I added some
interrupt disabling that is a bad idea in real-time preempt kernels,
but necessary for stock kernels to even have a ghost of a chance.

Again, the diff is quite confusing to read (for me, anyway!), so attached
the rcupdate.[hc] files.

Assuming this patch survives the LTP run (hah!!!), next step is a small
SMP system.

Thanx, Paul

Signed-off-by: <[EMAIL PROTECTED]>

diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h 
linux-2.5-rtRCU/include/linux/rcupdate.h
--- linux-2.5/include/linux/rcupdate.h  Wed Mar  9 12:37:06 2005
+++ linux-2.5-rtRCU/include/linux/rcupdate.hFri Mar 18 11:37:02 2005
@@ -58,169 +58,11 @@ struct rcu_head {
(ptr)->next = NULL; (ptr)->func = NULL; \
 } while (0)
 
-
-
-/* Global control variables for rcupdate callback mechanism. */
-struct rcu_ctrlblk {
-   longcur;/* Current batch number.  */
-   longcompleted;  /* Number of the last completed batch */
-   int next_pending;   /* Is the next batch already waiting? */
-} cacheline_maxaligned_in_smp;
-
-/* Is batch a before batch b ? */
-static inline int rcu_batch_before(long a, long b)
-{
-return (a - b) < 0;
-}
-
-/* Is batch a after batch b ? */
-static inline int rcu_batch_after(long a, long b)
-{
-return (a - b) > 0;
-}
-
-/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
-struct rcu_data {
-   /* 1) quiescent state handling : */
-   longquiescbatch; /* Batch # for grace period */
-   int passed_quiesc;   /* User-mode/idle loop etc. */
-   int qs_pending;  /* core waits for quiesc state */
-
-   /* 2) batch handling */
-   longbatch;   /* Batch # for current RCU batch */
-   struct rcu_head *nxtlist;
-   struct rcu_head **nxttail;
-   struct rcu_head *curlist;
-   struct rcu_head **curtail;
-   struct rcu_head *donelist;
-   struct rcu_head **donetail;
-   int cpu;
-};
-
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-extern struct rcu_ctrlblk rcu_ctrlblk;
-extern struct rcu_ctrlblk rcu_bh_ctrlblk;
-
-/*
- * Increment the quiescent state counter.
- * The counter is a bit degenerated: We do not need to know
- * how many quiescent states passed, just if there was at least
- * one since the start of the grace period. Thus just a flag.
- */
-static inline void rcu_qsctr_inc(int cpu)
-{
-   struct rcu_data *rdp = _cpu(rcu_data, cpu);
-   rdp->passed_quiesc = 1;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
-   struct rcu_data *rdp = _cpu(rcu_bh_data, cpu);
-   rdp->passed_quiesc = 1;
-}
-
-static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
-   struct rcu_data *rdp)
-{
-   /* This cpu has pending rcu entries and the grace period
-* for them has completed.
-*/
-   if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
-   return 1;
-
-   /* This cpu has no pending entries, but there are new entries */
-   if (!rdp->curlist && rdp->nxtlist)
-   return 1;
-
-   /* This cpu has finished callbacks to invoke */
-   if (rdp->donelist)
-   return 1;
-
-   /* The rcu core waits for a quiescent state from the cpu */
-   if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
-   return 1;
-
-   /* nothing to do */
-   return 0;
-}
-
-static inline int rcu_pending(int cpu)
-{
-   return __rcu_pending(_ctrlblk, _cpu(rcu_data, cpu)) ||
-   __rcu_pending(_bh_ctrlblk, _cpu(rcu_bh_data, cpu));
-}
-
-/**
- * rcu_read_lock - mark the beginning of an RCU read-side critical section.
- *
- * When synchronize_kernel() is invoked on one CPU while other CPUs
- * are within RCU read-side critical sections, then the
- * synchronize_kernel() is guaranteed to block until after all the other
- * CPUs exit their critical sections.  Similarly, if call_rcu() is invoked
- * on one CPU while other CPUs are within RCU read-side critical
- * sections, invocation of the corresponding RCU callback is deferred
- * until after the all the other CPUs exit their critical sections.
- *
- * Note, however, that RCU callbacks are permitted to run concurrently
- * with RCU read-side critical sections.  One way that this can happen
- * is via the following sequence of events: (1) CPU 0 enters an RCU
- * read-side critical section, (2) CPU 1 invokes 

Re: Real-Time Preemption and RCU

2005-03-18 Thread Herbert Xu
Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> i really have no intention to allow multiple readers for rt-mutexes. We
> got away with that so far, and i'd like to keep it so. Imagine 100
> threads all blocked in the same critical section (holding the read-lock)
> when a highprio writer thread comes around: instant 100x latency to let
> all of them roll forward. The only sane solution is to not allow
> excessive concurrency. (That limits SMP scalability, but there's no
> other choice i can see.)

What about allowing only as many concurrent readers as there are CPUs?
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 06:11:26PM +0100, Ingo Molnar wrote:
> 
> * Paul E. McKenney <[EMAIL PROTECTED]> wrote:
> 
> > For the patch, here are my questions:
> > 
> > o   What is the best way to select between classic RCU and this
> > scheme?
> > 
> > 1.  Massive #ifdef across rcupdate.c
> > 
> > 2.  Create an rcupdate_rt.c and browbeat the build system
> > into picking one or the other (no clue if this is
> > possible...)
> > 
> > 3.  Create an rcupdate_rt.c and rely on the linker to pick
> > one or the other, with rcupdate.h generating different
> > external symbol names to make the choice.
> 
> you can also go for option #0: just replace the existing RCU code with
> the new one, and i'll then deal with the configuration details.
> 
> what will have to happen is most likely #2 (since there is near zero
> code sharing between the two variants, right?). Picking rcupdate_rt.c is
> as simple as doing this:
> 
>  obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o
>  obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o
> 
> and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU
> (default) or CONFIG_PREEMPT_RCU (if the user selects it).
> 
> but it's not yet clear whether we want to offer this to users as a
> configurable option. The simplest solution for you would be to go with
> option #0 :-) [or if you prefer switchability, #1 is good too - i can
> then extract the bits and do #2 based on that.]
> 
> > o   How best to interface to OOM?  Left to myself, I leave this
> > for later.  ;-)
> 
> yeah, i'd not worry about OOM that much at this stage.
> 
> > I will take the cowardly approach of patching against the upstream
> > kernel.
> 
> sure. This is in fact easier for me: i'll first rip all my RCU hackery
> out of -RT and then add your code, so the base i'll be merging against
> will be closer to upstream than to current -RT.

Compiles, probably dies horribly.  "diff" didn't do such a good job
on this one, so attaching the raw rcupdate.[hc] files as well.

Thanx, Paul

Signed-off-by: <[EMAIL PROTECTED]>

diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h 
linux-2.5-rtRCU/include/linux/rcupdate.h
--- linux-2.5/include/linux/rcupdate.h  Wed Mar  9 12:37:06 2005
+++ linux-2.5-rtRCU/include/linux/rcupdate.hFri Mar 18 11:37:02 2005
@@ -58,169 +58,11 @@ struct rcu_head {
(ptr)->next = NULL; (ptr)->func = NULL; \
 } while (0)
 
-
-
-/* Global control variables for rcupdate callback mechanism. */
-struct rcu_ctrlblk {
-   longcur;/* Current batch number.  */
-   longcompleted;  /* Number of the last completed batch */
-   int next_pending;   /* Is the next batch already waiting? */
-} cacheline_maxaligned_in_smp;
-
-/* Is batch a before batch b ? */
-static inline int rcu_batch_before(long a, long b)
-{
-return (a - b) < 0;
-}
-
-/* Is batch a after batch b ? */
-static inline int rcu_batch_after(long a, long b)
-{
-return (a - b) > 0;
-}
-
-/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
-struct rcu_data {
-   /* 1) quiescent state handling : */
-   longquiescbatch; /* Batch # for grace period */
-   int passed_quiesc;   /* User-mode/idle loop etc. */
-   int qs_pending;  /* core waits for quiesc state */
-
-   /* 2) batch handling */
-   longbatch;   /* Batch # for current RCU batch */
-   struct rcu_head *nxtlist;
-   struct rcu_head **nxttail;
-   struct rcu_head *curlist;
-   struct rcu_head **curtail;
-   struct rcu_head *donelist;
-   struct rcu_head **donetail;
-   int cpu;
-};
-
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-extern struct rcu_ctrlblk rcu_ctrlblk;
-extern struct rcu_ctrlblk rcu_bh_ctrlblk;
-
-/*
- * Increment the quiescent state counter.
- * The counter is a bit degenerated: We do not need to know
- * how many quiescent states passed, just if there was at least
- * one since the start of the grace period. Thus just a flag.
- */
-static inline void rcu_qsctr_inc(int cpu)
-{
-   struct rcu_data *rdp = _cpu(rcu_data, cpu);
-   rdp->passed_quiesc = 1;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
-   struct rcu_data *rdp = _cpu(rcu_bh_data, cpu);
-   rdp->passed_quiesc = 1;
-}
-
-static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
-   struct rcu_data *rdp)
-{
-   /* This cpu has pending rcu entries and the grace period
-* for them has completed.
-*/
-   if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
-   return 1;
-
-   /* This cpu has no pending entries, but 

Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 06:11:26PM +0100, Ingo Molnar wrote:
> 
> * Paul E. McKenney <[EMAIL PROTECTED]> wrote:
> 
> > For the patch, here are my questions:
> > 
> > o   What is the best way to select between classic RCU and this
> > scheme?
> > 
> > 1.  Massive #ifdef across rcupdate.c
> > 
> > 2.  Create an rcupdate_rt.c and browbeat the build system
> > into picking one or the other (no clue if this is
> > possible...)
> > 
> > 3.  Create an rcupdate_rt.c and rely on the linker to pick
> > one or the other, with rcupdate.h generating different
> > external symbol names to make the choice.
> 
> you can also go for option #0: just replace the existing RCU code with
> the new one, and i'll then deal with the configuration details.

Having just spent the past few minutes descending into #ifdef hell,
I agree completely with your option #0.

> what will have to happen is most likely #2 (since there is near zero
> code sharing between the two variants, right?). Picking rcupdate_rt.c is
> as simple as doing this:
> 
>  obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o
>  obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o
> 
> and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU
> (default) or CONFIG_PREEMPT_RCU (if the user selects it).

Cool!  Thank you for the tutorial!  And yes, there is no shared code
that I can see.

> but it's not yet clear whether we want to offer this to users as a
> configurable option. The simplest solution for you would be to go with
> option #0 :-) [or if you prefer switchability, #1 is good too - i can
> then extract the bits and do #2 based on that.]

Option #0 it is -- I will stick with the locking algorithms and let
wiser and more experienced heads work out the configuration issues.

> > o   How best to interface to OOM?  Left to myself, I leave this
> > for later.  ;-)
> 
> yeah, i'd not worry about OOM that much at this stage.
> 
> > I will take the cowardly approach of patching against the upstream
> > kernel.
> 
> sure. This is in fact easier for me: i'll first rip all my RCU hackery
> out of -RT and then add your code, so the base i'll be merging against
> will be closer to upstream than to current -RT.

Sounds good to me!

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Esben Nielsen <[EMAIL PROTECTED]> wrote:

> Why can should there only be one RCU-reader per CPU at each given
> instance? Even on a real-time UP system it would be very helpfull to
> have RCU areas to be enterable by several tasks as once. It would
> perform better, both wrt. latencies and throughput: With the above
> implementation an high priority task entering an RCU area will have to
> boost the current RCU reader, make a task switch until that one
> finishes and makes yet another task switch. to get back to the high
> priority task. With an RCU implementation which can take n RCU readers
> per CPU there is no such problem.

correct, for RCU we could allow multiple readers per lock, because the
'blocking' side of RCU (callback processing) is never (supposed to be)
in any latency path.

except if someone wants to make RCU callback processing deterministic at
some point. (e.g. for memory management reasons.)

clearly the simplest solution is to go with the single-reader locks for
now - a separate experiment could be done with a new type of rwlock that
can only be used by the RCU code. (I'm not quite sure whether we could
guarantee a minimum rate of RCU callback processing under such a scheme
though. It's an eventual memory DoS otherwise.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney <[EMAIL PROTECTED]> wrote:

> For the patch, here are my questions:
> 
> o What is the best way to select between classic RCU and this
>   scheme?
> 
>   1.  Massive #ifdef across rcupdate.c
> 
>   2.  Create an rcupdate_rt.c and browbeat the build system
>   into picking one or the other (no clue if this is
>   possible...)
> 
>   3.  Create an rcupdate_rt.c and rely on the linker to pick
>   one or the other, with rcupdate.h generating different
>   external symbol names to make the choice.

you can also go for option #0: just replace the existing RCU code with
the new one, and i'll then deal with the configuration details.

what will have to happen is most likely #2 (since there is near zero
code sharing between the two variants, right?). Picking rcupdate_rt.c is
as simple as doing this:

 obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o
 obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o

and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU
(default) or CONFIG_PREEMPT_RCU (if the user selects it).

but it's not yet clear whether we want to offer this to users as a
configurable option. The simplest solution for you would be to go with
option #0 :-) [or if you prefer switchability, #1 is good too - i can
then extract the bits and do #2 based on that.]

> o How best to interface to OOM?  Left to myself, I leave this
>   for later.  ;-)

yeah, i'd not worry about OOM that much at this stage.

> I will take the cowardly approach of patching against the upstream
> kernel.

sure. This is in fact easier for me: i'll first rip all my RCU hackery
out of -RT and then add your code, so the base i'll be merging against
will be closer to upstream than to current -RT.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Esben Nielsen

On Fri, 18 Mar 2005, Ingo Molnar wrote:

> 
> * Bill Huey <[EMAIL PROTECTED]> wrote:
> 
> > I'd like to note another problem. Mingo's current implementation of
> > rt_mutex (super mutex for all blocking synchronization) is still
> > missing reader counts and something like that would have to be
> > implemented if you want to do priority inheritance over blocks.
> 
> i really have no intention to allow multiple readers for rt-mutexes. We
> got away with that so far, and i'd like to keep it so. Imagine 100
> threads all blocked in the same critical section (holding the read-lock)
> when a highprio writer thread comes around: instant 100x latency to let
> all of them roll forward. The only sane solution is to not allow
> excessive concurrency. (That limits SMP scalability, but there's no
> other choice i can see.)
>

Unless a design change is made: One could argue for a semantics where
write-locking _isn't_ deterministic and thus do not have to boost all the
readers. Readers boost the writers but not the other way around. Readers
will be deterministic, but not writers.
Such a semantics would probably work for a lot of RT applications
happening not to take any write-locks - these will in fact perform better. 
But it will give the rest a lot of problems.
 
>   Ingo

Esben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Esben Nielsen
On Fri, 18 Mar 2005, Ingo Molnar wrote:

> 
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > [...] How about something like:
> > 
> > void
> > rcu_read_lock(void)
> > {
> > preempt_disable();
> > if (current->rcu_read_lock_nesting++ == 0) {
> > current->rcu_read_lock_ptr =
> > &__get_cpu_var(rcu_data).lock;
> > preempt_enable();
> > read_lock(current->rcu_read_lock_ptr);
> > } else
> > preempt_enable();
> > }
> > 
> > this would still make it 'statistically scalable' - but is it correct?
> 
> thinking some more about it, i believe it's correct, because it picks
> one particular CPU's lock and correctly releases that lock.
> 
> (read_unlock() is atomic even on PREEMPT_RT, so rcu_read_unlock() is
> fine.)
> 

Why can should there only be one RCU-reader per CPU at each given
instance? Even on a real-time UP system it would be very helpfull to have
RCU areas to be enterable by several tasks as once. It would perform
better, both wrt. latencies and throughput: 
With the above implementation an high priority task entering an RCU area
will have to boost the current RCU reader, make a task switch until that
one finishes and makes yet another task switch. to get back to the high
priority task. With an RCU implementation which can take n RCU readers per CPU
there is no such problem.

Also having all tasks serializing on one lock (per CPU) really destroys
the real-time properties: The latency of anything which uses RCU will be
the worst latency of anything done under the RCU lock.

When I looked briefly at it in the fall the following solution jumped into
mind: Have a RCU-reader count, rcu_read_count, for each CPU. When you
enter an RCU read region increment it and decrement it when you go out of
it. When it is 0, RCU cleanups are allowed - a perfect quiescent state. At
that point call rcu_qsctr_inc() at that point. Or call it in schedule() as
now just with a if(rcu_read_count==0) around it.

I don't think I understand the current code. But if it works now with
preempt_disable()/preempt_enable() around all the read-regions it ought to
work with 
preempt_enable();
rcu_read_count++/--;
preempt_disable() 
around the same regions and the above check for rcu_read_count==0 in or
around rcu_qsctr_inc() as well.

It might take a long time before the rcu-batches are actually called,
though, but that is a different story, which can be improved upon. An
improvemnt would be to boost the none-RT tasks entering a rcu-read region
into the lowest RT-priority. That way there can't be a lot of low
priority tasks hanging around making rcu_read_count non-zero for a long
period of time since these tasks only can be preempted by RT tasks while
in the RCU-region.

>   Ingo

Esben



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 08:49:03AM +0100, Ingo Molnar wrote:
> 
> * Paul E. McKenney <[EMAIL PROTECTED]> wrote:
> 
> > Seems to me that it would be good to have an RCU implementation that
> > meet the requirements of the Real-Time Preemption patch, but that is
> > 100% compatible with the "classic RCU" API.  Such an implementation
> > must meet a number of requirements, which are listed at the end of
> > this message (search for "REQUIREMENTS").
> 
> [ Wow. you must be a secret telepatic mind-reader - yesterday i was
>   thinking about mailing you, because my approach to RCU preemptability
>   (the API variants) clearly sucked and caused problems all around, both
>   in terms of maintainability and in terms of stability and
>   scalability. ]

Wish I could claim that, but must credit dumb luck.  ;-)

> > 5.  The final version, which both scales and meets realtime
> > requirements, as well as exactly matching the "classic RCU"
> > API.
> > 
> > I have tested this approach, but in user-level scaffolding.  All of
> > these implementations should therefore be regarded with great
> > suspicion: untested, probably don't even compile.  Besides which, I
> > certainly can't claim to fully understand the real-time preempt patch,
> > so I am bound to have gotten something wrong somewhere.  In any case,
> > none of these implementations are a suitable replacement for "classic
> > RCU" on large servers, since they acquire locks in the RCU read-side
> > critical sections. However, they should scale enough to support small
> > SMP systems, inflicting only a modest performance penalty.
> 
> basically for PREEMPT_RT the only constraint is that RCU sections should
> be preemptable. Whatever the performance cost. If PREEMPT_RT is merged
> into the upstream kernel then it will (at least initially) be at a
> status similar to NOMMU: it will be tolerated as long as it causes no
> 'drag' on the main code. The RCU API variants i introduced clearly
> violated this requirement, and were my #1 worry wrt. upstream
> mergability.

Fair enough!

> > I believe that implementation #5 is most appropriate for real-time
> > preempt kernels. [...]
> 
> yeah, agreed - it looks perfect - both the read and write side is
> preemptable. Can i just plug the code you sent into rcupudate.c and
> expect it to work, or would you like to send a patch? If you prefer you
> can make it an unconditional patch against an upstream kernel to keep
> things simple for you - i'll then massage it to be properly PREEMPT_RT
> dependent.

I will give it a shot, but feel free to grab anything at any point,
starting either from the patch I will send in a bit or from my earlier
email.  Do whatever works best for you.

For the patch, here are my questions:

o   What is the best way to select between classic RCU and this
scheme?

1.  Massive #ifdef across rcupdate.c

2.  Create an rcupdate_rt.c and browbeat the build system
into picking one or the other (no clue if this is
possible...)

3.  Create an rcupdate_rt.c and rely on the linker to pick
one or the other, with rcupdate.h generating different
external symbol names to make the choice.

Left to myself, I would grit my teeth and do #1, since I
understand how to do it.  Flames welcome, as long as accompanied
by good advice!

o   How best to interface to OOM?  Left to myself, I leave this
for later.  ;-)

I will take the cowardly approach of patching against the upstream kernel.

> > [...] In theory, #3 might be appropriate, but if I understand the
> > real-time preempt implementation of reader-writer lock, it will not
> > perform well if there are long RCU read-side critical sections, even
> > in UP kernels.
> 
> all RCU-locked sections must be preemptable in -RT.  Basically RCU is a
> mainstream API that is used by lots of code and will be introduced in
> many other areas as well. From the -RT kernel's POV sees this as an
> 'uncontrollable latency source', which keeps introducing critical
> sections. One major goal of PREEMPT_RT is to convert all popular
> critical section APIs into preemptible sections, so that the amount of
> code that is non-preemptable is drastically reduced and can be managed
> (and thus can be trusted). This goal has a higher priority than any
> performance consideration, because it doesnt matter what performance you
> have, if you cannot trust the kernel to be deterministic.

Makes sense to me!

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Bill Huey <[EMAIL PROTECTED]> wrote:

> I'd like to note another problem. Mingo's current implementation of
> rt_mutex (super mutex for all blocking synchronization) is still
> missing reader counts and something like that would have to be
> implemented if you want to do priority inheritance over blocks.

i really have no intention to allow multiple readers for rt-mutexes. We
got away with that so far, and i'd like to keep it so. Imagine 100
threads all blocked in the same critical section (holding the read-lock)
when a highprio writer thread comes around: instant 100x latency to let
all of them roll forward. The only sane solution is to not allow
excessive concurrency. (That limits SMP scalability, but there's no
other choice i can see.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney <[EMAIL PROTECTED]> wrote:

> > >   preempt_disable();
> > >   if (current->rcu_read_lock_nesting++ == 0) {
> > >   current->rcu_read_lock_ptr =
> > >   &__get_cpu_var(rcu_data).lock;
> > >   read_lock(current->rcu_read_lock_ptr);
> > >   }
> > >   preempt_enable();

> My current thought is that the preempt_disable()/preempt_enable() can
> be dropped entirely.  Messes up any tool that browses through
> ->rcu_read_lock_nesting, but don't see any other problem.  Yet,
> anyway!

yeah - this sounds good. (We are not aiming for irq-safe RCU anyway, on
PREEMPT_RT.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 05:17:29AM -0800, Bill Huey wrote:
> On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
> > On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> > > 5. Scalability -and- Realtime Response.
> > ...
> > 
> > >   void
> > >   rcu_read_lock(void)
> > >   {
> > >   preempt_disable();
> > >   if (current->rcu_read_lock_nesting++ == 0) {
> > >   current->rcu_read_lock_ptr =
> > >   &__get_cpu_var(rcu_data).lock;
> > >   read_lock(current->rcu_read_lock_ptr);
> > >   }
> > >   preempt_enable();
> > >   }
> > 
> > Ok, here's a rather unsure question...
> > 
> > Uh, is that a sleep violation if that is exclusively held since it
> > can block within an atomic critical section (deadlock) ?
> 
> I'd like to note another problem. Mingo's current implementation of rt_mutex
> (super mutex for all blocking synchronization) is still missing reader counts
> and something like that would have to be implemented if you want to do 
> priority
> inheritance over blocks.
> 
> This is going to throw a wrench into your implementation if you assume that.

If we need to do priority inheritance across the memory allocator, so
that high-priority tasks blocked waiting for memory pass their priority
on to RCU readers, agreed.  But I don't see any sign that real-time
preempt does this.

In absence of this, as Ingo noted, the fact that readers don't block
each other should make things be safe.  I think...

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
> On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> > 5. Scalability -and- Realtime Response.
> ...
> 
> > void
> > rcu_read_lock(void)
> > {
> > preempt_disable();
> > if (current->rcu_read_lock_nesting++ == 0) {
> > current->rcu_read_lock_ptr =
> > &__get_cpu_var(rcu_data).lock;
> > read_lock(current->rcu_read_lock_ptr);
> > }
> > preempt_enable();
> > }
> 
> Ok, here's a rather unsure question...
> 
> Uh, is that a sleep violation if that is exclusively held since it
> can block within an atomic critical section (deadlock) ?

Hey, I wasn't joking when I said that I probably got something wrong!  ;-)

My current thought is that the preempt_disable()/preempt_enable() can
be dropped entirely.  Messes up any tool that browses through
->rcu_read_lock_nesting, but don't see any other problem.  Yet, anyway!

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 11:03:39AM +0100, Ingo Molnar wrote:
> 
> there's a problem in #5's rcu_read_lock():
> 
> void
> rcu_read_lock(void)
> {
> preempt_disable();
> if (current->rcu_read_lock_nesting++ == 0) {
> current->rcu_read_lock_ptr =
> &__get_cpu_var(rcu_data).lock;
> read_lock(current->rcu_read_lock_ptr);
> }
> preempt_enable();
> }
> 
> not only are read_lock()-ed sections preemptible, read_lock() itself may
> block, so it cannot be called from within preempt_disable(). How about
> something like:
> 
> void
> rcu_read_lock(void)
> {
> preempt_disable();
> if (current->rcu_read_lock_nesting++ == 0) {
> current->rcu_read_lock_ptr =
> &__get_cpu_var(rcu_data).lock;
> preempt_enable();
> read_lock(current->rcu_read_lock_ptr);
> } else
> preempt_enable();
> }
> 
> this would still make it 'statistically scalable' - but is it correct?

Good catch!

Also good question...

Strictly speaking, it is not necessary to block callback invocation until
just after rcu_read_lock() returns.

It is correct as long as there is no sort of "upcall" or "callback" that
can masquerade as this task.  I know of no such thing in the Linux kernel.
In fact such a thing would break a lot of code, right?

Any tool that relied on the ->rcu_read_lock_nesting counter to deduce
RCU state would be confused by this change, but there might be other
ways of handling this.  Also, we are currently making do without such
a tool.

It should be possible to move the preempt_enable() further forward
ahead of the assignment to ->rcu_read_lock_ptr, since the assignment
to ->rcu_read_lock_ptr is strictly local.  Not sure that this is
worthwhile, thoughts?

void
rcu_read_lock(void)
{
preempt_disable();
if (current->rcu_read_lock_nesting++ == 0) {
preempt_enable();
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
read_lock(current->rcu_read_lock_ptr);
} else
preempt_enable();
}

The other question is whether preempt_disable() is needed in the first
place.  The two task-structure fields are not accessed except by the
task itself.  I bet that the following is just fine:

void
rcu_read_lock(void)
{
if (current->rcu_read_lock_nesting++ == 0) {
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
read_lock(current->rcu_read_lock_ptr);
}
}

Thoughts?

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 10:53:27AM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > there's one detail on PREEMPT_RT though (which i think you noticed too). 
> > 
> > Priority inheritance handling can be done in a pretty straightforward
> > way as long as no true read-side nesting is allowed for rwsems and
> > rwlocks - i.e. there's only one owner of a lock at a time. So
> > PREEMPT_RT restricts rwsem and rwlock concurrency: readers are
> > writers, with the only exception that they are allowed to 'self-nest'.
> > [...]
> 
> this does not affect read-side RCU, because read-side RCU can never
> block a higher-prio thread. (as long as callback processing is pushed
> into a separate kernel thread.)
> 
> so RCU will be pretty much the only mechanism (besides lock-free code)
> that allows reader concurrency on PREEMPT_RT.

This is a relief!  I was wondering how on earth I was going to solve
the multi-task priority-inheritance problem!

But...  How do we handle the following scenario?

0.  A bunch of low-priority threads are preempted in the
middle of various RCU read-side critical sections.

1.  High-priority thread does kmalloc(), but there is no
memory, so it blocks.

2.  OOM handling notices, and decides to clean up the outstanding
RCU callbacks.  It therefore invokes _synchronize_kernel()
(in implementation #5).

3.  The _synchronize_kernel() function tries to acquire all of
the read-side locks, which are held by numerous preempted
low-priority threads.

4.  What now???

Or does the current patch do priority inheritance across the memory
allocator?  In other words, can we get away with ignoring this one?  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread hui
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
> On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> > 5. Scalability -and- Realtime Response.
> ...
> 
> > void
> > rcu_read_lock(void)
> > {
> > preempt_disable();
> > if (current->rcu_read_lock_nesting++ == 0) {
> > current->rcu_read_lock_ptr =
> > &__get_cpu_var(rcu_data).lock;
> > read_lock(current->rcu_read_lock_ptr);
> > }
> > preempt_enable();
> > }
> 
> Ok, here's a rather unsure question...
> 
> Uh, is that a sleep violation if that is exclusively held since it
> can block within an atomic critical section (deadlock) ?

I'd like to note another problem. Mingo's current implementation of rt_mutex
(super mutex for all blocking synchronization) is still missing reader counts
and something like that would have to be implemented if you want to do priority
inheritance over blocks.

This is going to throw a wrench into your implementation if you assume that.

bill


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread hui
On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> 5. Scalability -and- Realtime Response.
...

>   void
>   rcu_read_lock(void)
>   {
>   preempt_disable();
>   if (current->rcu_read_lock_nesting++ == 0) {
>   current->rcu_read_lock_ptr =
>   &__get_cpu_var(rcu_data).lock;
>   read_lock(current->rcu_read_lock_ptr);
>   }
>   preempt_enable();
>   }

Ok, here's a rather unsure question...

Uh, is that a sleep violation if that is exclusively held since it
can block within an atomic critical section (deadlock) ?

bill


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney <[EMAIL PROTECTED]> wrote:

> 5. Scalability -and- Realtime Response.
> 
> The trick here is to keep track of the CPU that did the
> rcu_read_lock() in the task structure.  If there is a preemption,
> there will be some slight inefficiency, but the correct lock will be
> released, preserving correctness.

the inefficiency will be larger, because (as explained in a previous
mail) multiple concurrent owners of the read-lock are not allowed. This
adds to the overhead of PREEMPT_RT on SMP, but is an intentional
tradeoff. I dont expect PREEMPT_RT to be used on an Altix :-|

#5 is still the best option, because in the normal 'infrequent
preemption' case it will behave in a cache-friendly way. A positive
effect of the lock serializing is that the callback backlog will stay
relatively small and that the RCU backlog processing can be made
deterministic as well (if needed), by making the backlog processing
thread(s) SCHED_FIFO.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> [...] How about something like:
> 
> void
> rcu_read_lock(void)
> {
> preempt_disable();
> if (current->rcu_read_lock_nesting++ == 0) {
> current->rcu_read_lock_ptr =
> &__get_cpu_var(rcu_data).lock;
> preempt_enable();
> read_lock(current->rcu_read_lock_ptr);
> } else
> preempt_enable();
> }
> 
> this would still make it 'statistically scalable' - but is it correct?

thinking some more about it, i believe it's correct, because it picks
one particular CPU's lock and correctly releases that lock.

(read_unlock() is atomic even on PREEMPT_RT, so rcu_read_unlock() is
fine.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

there's a problem in #5's rcu_read_lock():

void
rcu_read_lock(void)
{
preempt_disable();
if (current->rcu_read_lock_nesting++ == 0) {
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
read_lock(current->rcu_read_lock_ptr);
}
preempt_enable();
}

not only are read_lock()-ed sections preemptible, read_lock() itself may
block, so it cannot be called from within preempt_disable(). How about
something like:

void
rcu_read_lock(void)
{
preempt_disable();
if (current->rcu_read_lock_nesting++ == 0) {
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
preempt_enable();
read_lock(current->rcu_read_lock_ptr);
} else
preempt_enable();
}

this would still make it 'statistically scalable' - but is it correct?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> there's one detail on PREEMPT_RT though (which i think you noticed too). 
> 
> Priority inheritance handling can be done in a pretty straightforward
> way as long as no true read-side nesting is allowed for rwsems and
> rwlocks - i.e. there's only one owner of a lock at a time. So
> PREEMPT_RT restricts rwsem and rwlock concurrency: readers are
> writers, with the only exception that they are allowed to 'self-nest'.
> [...]

this does not affect read-side RCU, because read-side RCU can never
block a higher-prio thread. (as long as callback processing is pushed
into a separate kernel thread.)

so RCU will be pretty much the only mechanism (besides lock-free code)
that allows reader concurrency on PREEMPT_RT.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> * Paul E. McKenney <[EMAIL PROTECTED]> wrote:
> 
> > 4. Preemptible read side.
> > 
> > RCU read-side critical sections can (in theory, anyway) be quite
> > large, degrading realtime scheduling response.  Preemptible RCU
> > read-side critical sections avoid such degradation.  Manual
> > insertion of "preemption points" might be an alternative as well.
> > But I have no intention of trying to settle the long-running
> > argument between proponents of preemption and of preemption
> > points.  Not today, anyway!  ;-)
> 
> i'm cleverly sidestepping that argument by offering 4 preemption
> models in the -RT tree :-) We dont have to pick a winner, users will.
> [...]

also, it turned out that "preemption points" vs. "forced preemption" are
not exclusive concepts: PREEMPT_RT relies on _both_ of them.

when a highprio task tries to acquire a lock that another, lower-prio
task already holds, then 'the time it takes for the lowprio task to drop
the lock' directly depends on the frequency of explicit preemption
points within the critical section. So to get good 'lock dependent'
latencies on PREEMPT_RT we need both a good distribution of preemption
points (within locked sections).

(obviously, 'lock independent' preemption latencies purely depends on
the quality of forced preemption and the size of non-preempt critical
sections, not on any explicit preemption point.)

dont we love seemingly conflicting concepts that end up helping each
other? It's a nice flamewar-killer ;-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> 
> * Paul E. McKenney <[EMAIL PROTECTED]> wrote:
> 
> > I have tested this approach, but in user-level scaffolding.  All of
> > these implementations should therefore be regarded with great
> > suspicion: untested, probably don't even compile.  Besides which, I
> > certainly can't claim to fully understand the real-time preempt patch,
> > so I am bound to have gotten something wrong somewhere. [...]
> 
> you dont even have to consider the -RT patchset: if the scheme allows
> forced preemption of read-side RCU sections on current upstream
> CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too.

there's one detail on PREEMPT_RT though (which i think you noticed too). 

Priority inheritance handling can be done in a pretty straightforward
way as long as no true read-side nesting is allowed for rwsems and
rwlocks - i.e. there's only one owner of a lock at a time. So PREEMPT_RT
restricts rwsem and rwlock concurrency: readers are writers, with the
only exception that they are allowed to 'self-nest'. I.e. things like:

read_lock();
...
read_lock();

are still legal. (it's also done quite often.)

(it is virtually impossible to implement priority inheritance for true
multi-reader locks in any sane way: i've done it initially and it sucks
very much. It also fundamentally increases the 'lock-dependent'
worst-case latencies - imagine 4 readers having to finish first if a
higher-prio writer comes along. It's insane.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney <[EMAIL PROTECTED]> wrote:

> I have tested this approach, but in user-level scaffolding.  All of
> these implementations should therefore be regarded with great
> suspicion: untested, probably don't even compile.  Besides which, I
> certainly can't claim to fully understand the real-time preempt patch,
> so I am bound to have gotten something wrong somewhere. [...]

you dont even have to consider the -RT patchset: if the scheme allows
forced preemption of read-side RCU sections on current upstream
CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney <[EMAIL PROTECTED]> wrote:

> 4. Preemptible read side.
> 
>   RCU read-side critical sections can (in theory, anyway) be quite
>   large, degrading realtime scheduling response.  Preemptible RCU
>   read-side critical sections avoid such degradation.  Manual
>   insertion of "preemption points" might be an alternative as well.
>   But I have no intention of trying to settle the long-running
>   argument between proponents of preemption and of preemption
>   points.  Not today, anyway!  ;-)

i'm cleverly sidestepping that argument by offering 4 preemption models
in the -RT tree :-) We dont have to pick a winner, users will. The way
low latencies are achieved depends on the preemption model:

   ( ) No Forced Preemption (Server)
   ( ) Voluntary Kernel Preemption (Desktop)
   ( ) Preemptible Kernel (Low-Latency Desktop)
   (X) Complete Preemption (Real-Time)

"Server" is the current default !PREEMPT model. Best throughput, bad
worst-case latencies.

"Low-Latency Desktop" is the current PREEMPT model. Has some runtime
overhead relative to Server, offers fair worst-case latencies.

"Desktop" is a new mode that is somewhere between Server and Low-Latency
Desktop: it's what was initially called 'voluntary preemption'. It
doesnt make the kernel forcibly preemptible anywhere, but inserts a fair
number of preemption points to decrease latencies statistically, while
keeping the runtime overhead close to the "Server". (a variant of this
model is utilized by Fedora and RHEL4 currently, so if this gets
upstream i expect distros to pick it up - it can be a migration helper
towards the "Low-Latency Desktop" model.)

"Real-Time" is the no-compromises hard-RT model: almost everything but
the scheduler itself is fully preemptible. Phenomenal worst-case
latencies in every workload scenario, but also has the highest runtime
overhead.

preemptable RCU makes sense for the "Low-Latency Desktop" and
"Real-Time" preemption models - these are the ones that do forced
preemption.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney <[EMAIL PROTECTED]> wrote:

>   [I believe that the real-time preemption patch moves code
>   from softirq/interrupt to process context, but could easily be
>   missing something.  If need be, there are ways of handling cases
>   were realtime RCU is called from softirq and interrupt contexts,
>   for an example approach, see the Quick Quiz answers.]

correct, on PREEMPT_RT, IRQ processing is (and must be) in process
context.

(it's pretty much a must-have thing to enable the preemption of irq
handlers and softirq processing: if irq/softirq contexts nested on the
kernel stack then irq contexts would 'pin down' the process context that
was interrupted, so even if irq preemption was possible, the underlying
process context couldnt be freely scheduled.)

still, it's nice that your #5 design does not rely on it - it's a sign
of fundamental robustness, plus we could (in theory) offer preemptable
RCU on PREEMPT_DESKTOP (== current PREEMPT) as well - just like there is
PREEMPT_BKL.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney [EMAIL PROTECTED] wrote:

   [I believe that the real-time preemption patch moves code
   from softirq/interrupt to process context, but could easily be
   missing something.  If need be, there are ways of handling cases
   were realtime RCU is called from softirq and interrupt contexts,
   for an example approach, see the Quick Quiz answers.]

correct, on PREEMPT_RT, IRQ processing is (and must be) in process
context.

(it's pretty much a must-have thing to enable the preemption of irq
handlers and softirq processing: if irq/softirq contexts nested on the
kernel stack then irq contexts would 'pin down' the process context that
was interrupted, so even if irq preemption was possible, the underlying
process context couldnt be freely scheduled.)

still, it's nice that your #5 design does not rely on it - it's a sign
of fundamental robustness, plus we could (in theory) offer preemptable
RCU on PREEMPT_DESKTOP (== current PREEMPT) as well - just like there is
PREEMPT_BKL.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney [EMAIL PROTECTED] wrote:

 4. Preemptible read side.
 
   RCU read-side critical sections can (in theory, anyway) be quite
   large, degrading realtime scheduling response.  Preemptible RCU
   read-side critical sections avoid such degradation.  Manual
   insertion of preemption points might be an alternative as well.
   But I have no intention of trying to settle the long-running
   argument between proponents of preemption and of preemption
   points.  Not today, anyway!  ;-)

i'm cleverly sidestepping that argument by offering 4 preemption models
in the -RT tree :-) We dont have to pick a winner, users will. The way
low latencies are achieved depends on the preemption model:

   ( ) No Forced Preemption (Server)
   ( ) Voluntary Kernel Preemption (Desktop)
   ( ) Preemptible Kernel (Low-Latency Desktop)
   (X) Complete Preemption (Real-Time)

Server is the current default !PREEMPT model. Best throughput, bad
worst-case latencies.

Low-Latency Desktop is the current PREEMPT model. Has some runtime
overhead relative to Server, offers fair worst-case latencies.

Desktop is a new mode that is somewhere between Server and Low-Latency
Desktop: it's what was initially called 'voluntary preemption'. It
doesnt make the kernel forcibly preemptible anywhere, but inserts a fair
number of preemption points to decrease latencies statistically, while
keeping the runtime overhead close to the Server. (a variant of this
model is utilized by Fedora and RHEL4 currently, so if this gets
upstream i expect distros to pick it up - it can be a migration helper
towards the Low-Latency Desktop model.)

Real-Time is the no-compromises hard-RT model: almost everything but
the scheduler itself is fully preemptible. Phenomenal worst-case
latencies in every workload scenario, but also has the highest runtime
overhead.

preemptable RCU makes sense for the Low-Latency Desktop and
Real-Time preemption models - these are the ones that do forced
preemption.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney [EMAIL PROTECTED] wrote:

 I have tested this approach, but in user-level scaffolding.  All of
 these implementations should therefore be regarded with great
 suspicion: untested, probably don't even compile.  Besides which, I
 certainly can't claim to fully understand the real-time preempt patch,
 so I am bound to have gotten something wrong somewhere. [...]

you dont even have to consider the -RT patchset: if the scheme allows
forced preemption of read-side RCU sections on current upstream
CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 
 * Paul E. McKenney [EMAIL PROTECTED] wrote:
 
  I have tested this approach, but in user-level scaffolding.  All of
  these implementations should therefore be regarded with great
  suspicion: untested, probably don't even compile.  Besides which, I
  certainly can't claim to fully understand the real-time preempt patch,
  so I am bound to have gotten something wrong somewhere. [...]
 
 you dont even have to consider the -RT patchset: if the scheme allows
 forced preemption of read-side RCU sections on current upstream
 CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too.

there's one detail on PREEMPT_RT though (which i think you noticed too). 

Priority inheritance handling can be done in a pretty straightforward
way as long as no true read-side nesting is allowed for rwsems and
rwlocks - i.e. there's only one owner of a lock at a time. So PREEMPT_RT
restricts rwsem and rwlock concurrency: readers are writers, with the
only exception that they are allowed to 'self-nest'. I.e. things like:

read_lock(rwlock);
...
read_lock(rwlock);

are still legal. (it's also done quite often.)

(it is virtually impossible to implement priority inheritance for true
multi-reader locks in any sane way: i've done it initially and it sucks
very much. It also fundamentally increases the 'lock-dependent'
worst-case latencies - imagine 4 readers having to finish first if a
higher-prio writer comes along. It's insane.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 * Paul E. McKenney [EMAIL PROTECTED] wrote:
 
  4. Preemptible read side.
  
  RCU read-side critical sections can (in theory, anyway) be quite
  large, degrading realtime scheduling response.  Preemptible RCU
  read-side critical sections avoid such degradation.  Manual
  insertion of preemption points might be an alternative as well.
  But I have no intention of trying to settle the long-running
  argument between proponents of preemption and of preemption
  points.  Not today, anyway!  ;-)
 
 i'm cleverly sidestepping that argument by offering 4 preemption
 models in the -RT tree :-) We dont have to pick a winner, users will.
 [...]

also, it turned out that preemption points vs. forced preemption are
not exclusive concepts: PREEMPT_RT relies on _both_ of them.

when a highprio task tries to acquire a lock that another, lower-prio
task already holds, then 'the time it takes for the lowprio task to drop
the lock' directly depends on the frequency of explicit preemption
points within the critical section. So to get good 'lock dependent'
latencies on PREEMPT_RT we need both a good distribution of preemption
points (within locked sections).

(obviously, 'lock independent' preemption latencies purely depends on
the quality of forced preemption and the size of non-preempt critical
sections, not on any explicit preemption point.)

dont we love seemingly conflicting concepts that end up helping each
other? It's a nice flamewar-killer ;-)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 there's one detail on PREEMPT_RT though (which i think you noticed too). 
 
 Priority inheritance handling can be done in a pretty straightforward
 way as long as no true read-side nesting is allowed for rwsems and
 rwlocks - i.e. there's only one owner of a lock at a time. So
 PREEMPT_RT restricts rwsem and rwlock concurrency: readers are
 writers, with the only exception that they are allowed to 'self-nest'.
 [...]

this does not affect read-side RCU, because read-side RCU can never
block a higher-prio thread. (as long as callback processing is pushed
into a separate kernel thread.)

so RCU will be pretty much the only mechanism (besides lock-free code)
that allows reader concurrency on PREEMPT_RT.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

there's a problem in #5's rcu_read_lock():

void
rcu_read_lock(void)
{
preempt_disable();
if (current-rcu_read_lock_nesting++ == 0) {
current-rcu_read_lock_ptr =
__get_cpu_var(rcu_data).lock;
read_lock(current-rcu_read_lock_ptr);
}
preempt_enable();
}

not only are read_lock()-ed sections preemptible, read_lock() itself may
block, so it cannot be called from within preempt_disable(). How about
something like:

void
rcu_read_lock(void)
{
preempt_disable();
if (current-rcu_read_lock_nesting++ == 0) {
current-rcu_read_lock_ptr =
__get_cpu_var(rcu_data).lock;
preempt_enable();
read_lock(current-rcu_read_lock_ptr);
} else
preempt_enable();
}

this would still make it 'statistically scalable' - but is it correct?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 [...] How about something like:
 
 void
 rcu_read_lock(void)
 {
 preempt_disable();
 if (current-rcu_read_lock_nesting++ == 0) {
 current-rcu_read_lock_ptr =
 __get_cpu_var(rcu_data).lock;
 preempt_enable();
 read_lock(current-rcu_read_lock_ptr);
 } else
 preempt_enable();
 }
 
 this would still make it 'statistically scalable' - but is it correct?

thinking some more about it, i believe it's correct, because it picks
one particular CPU's lock and correctly releases that lock.

(read_unlock() is atomic even on PREEMPT_RT, so rcu_read_unlock() is
fine.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney [EMAIL PROTECTED] wrote:

 5. Scalability -and- Realtime Response.
 
 The trick here is to keep track of the CPU that did the
 rcu_read_lock() in the task structure.  If there is a preemption,
 there will be some slight inefficiency, but the correct lock will be
 released, preserving correctness.

the inefficiency will be larger, because (as explained in a previous
mail) multiple concurrent owners of the read-lock are not allowed. This
adds to the overhead of PREEMPT_RT on SMP, but is an intentional
tradeoff. I dont expect PREEMPT_RT to be used on an Altix :-|

#5 is still the best option, because in the normal 'infrequent
preemption' case it will behave in a cache-friendly way. A positive
effect of the lock serializing is that the callback backlog will stay
relatively small and that the RCU backlog processing can be made
deterministic as well (if needed), by making the backlog processing
thread(s) SCHED_FIFO.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread hui
On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
 5. Scalability -and- Realtime Response.
...

   void
   rcu_read_lock(void)
   {
   preempt_disable();
   if (current-rcu_read_lock_nesting++ == 0) {
   current-rcu_read_lock_ptr =
   __get_cpu_var(rcu_data).lock;
   read_lock(current-rcu_read_lock_ptr);
   }
   preempt_enable();
   }

Ok, here's a rather unsure question...

Uh, is that a sleep violation if that is exclusively held since it
can block within an atomic critical section (deadlock) ?

bill


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread hui
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
 On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
  5. Scalability -and- Realtime Response.
 ...
 
  void
  rcu_read_lock(void)
  {
  preempt_disable();
  if (current-rcu_read_lock_nesting++ == 0) {
  current-rcu_read_lock_ptr =
  __get_cpu_var(rcu_data).lock;
  read_lock(current-rcu_read_lock_ptr);
  }
  preempt_enable();
  }
 
 Ok, here's a rather unsure question...
 
 Uh, is that a sleep violation if that is exclusively held since it
 can block within an atomic critical section (deadlock) ?

I'd like to note another problem. Mingo's current implementation of rt_mutex
(super mutex for all blocking synchronization) is still missing reader counts
and something like that would have to be implemented if you want to do priority
inheritance over blocks.

This is going to throw a wrench into your implementation if you assume that.

bill


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 10:53:27AM +0100, Ingo Molnar wrote:
 
 * Ingo Molnar [EMAIL PROTECTED] wrote:
 
  there's one detail on PREEMPT_RT though (which i think you noticed too). 
  
  Priority inheritance handling can be done in a pretty straightforward
  way as long as no true read-side nesting is allowed for rwsems and
  rwlocks - i.e. there's only one owner of a lock at a time. So
  PREEMPT_RT restricts rwsem and rwlock concurrency: readers are
  writers, with the only exception that they are allowed to 'self-nest'.
  [...]
 
 this does not affect read-side RCU, because read-side RCU can never
 block a higher-prio thread. (as long as callback processing is pushed
 into a separate kernel thread.)
 
 so RCU will be pretty much the only mechanism (besides lock-free code)
 that allows reader concurrency on PREEMPT_RT.

This is a relief!  I was wondering how on earth I was going to solve
the multi-task priority-inheritance problem!

But...  How do we handle the following scenario?

0.  A bunch of low-priority threads are preempted in the
middle of various RCU read-side critical sections.

1.  High-priority thread does kmalloc(), but there is no
memory, so it blocks.

2.  OOM handling notices, and decides to clean up the outstanding
RCU callbacks.  It therefore invokes _synchronize_kernel()
(in implementation #5).

3.  The _synchronize_kernel() function tries to acquire all of
the read-side locks, which are held by numerous preempted
low-priority threads.

4.  What now???

Or does the current patch do priority inheritance across the memory
allocator?  In other words, can we get away with ignoring this one?  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 11:03:39AM +0100, Ingo Molnar wrote:
 
 there's a problem in #5's rcu_read_lock():
 
 void
 rcu_read_lock(void)
 {
 preempt_disable();
 if (current-rcu_read_lock_nesting++ == 0) {
 current-rcu_read_lock_ptr =
 __get_cpu_var(rcu_data).lock;
 read_lock(current-rcu_read_lock_ptr);
 }
 preempt_enable();
 }
 
 not only are read_lock()-ed sections preemptible, read_lock() itself may
 block, so it cannot be called from within preempt_disable(). How about
 something like:
 
 void
 rcu_read_lock(void)
 {
 preempt_disable();
 if (current-rcu_read_lock_nesting++ == 0) {
 current-rcu_read_lock_ptr =
 __get_cpu_var(rcu_data).lock;
 preempt_enable();
 read_lock(current-rcu_read_lock_ptr);
 } else
 preempt_enable();
 }
 
 this would still make it 'statistically scalable' - but is it correct?

Good catch!

Also good question...

Strictly speaking, it is not necessary to block callback invocation until
just after rcu_read_lock() returns.

It is correct as long as there is no sort of upcall or callback that
can masquerade as this task.  I know of no such thing in the Linux kernel.
In fact such a thing would break a lot of code, right?

Any tool that relied on the -rcu_read_lock_nesting counter to deduce
RCU state would be confused by this change, but there might be other
ways of handling this.  Also, we are currently making do without such
a tool.

It should be possible to move the preempt_enable() further forward
ahead of the assignment to -rcu_read_lock_ptr, since the assignment
to -rcu_read_lock_ptr is strictly local.  Not sure that this is
worthwhile, thoughts?

void
rcu_read_lock(void)
{
preempt_disable();
if (current-rcu_read_lock_nesting++ == 0) {
preempt_enable();
current-rcu_read_lock_ptr =
__get_cpu_var(rcu_data).lock;
read_lock(current-rcu_read_lock_ptr);
} else
preempt_enable();
}

The other question is whether preempt_disable() is needed in the first
place.  The two task-structure fields are not accessed except by the
task itself.  I bet that the following is just fine:

void
rcu_read_lock(void)
{
if (current-rcu_read_lock_nesting++ == 0) {
current-rcu_read_lock_ptr =
__get_cpu_var(rcu_data).lock;
read_lock(current-rcu_read_lock_ptr);
}
}

Thoughts?

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
 On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
  5. Scalability -and- Realtime Response.
 ...
 
  void
  rcu_read_lock(void)
  {
  preempt_disable();
  if (current-rcu_read_lock_nesting++ == 0) {
  current-rcu_read_lock_ptr =
  __get_cpu_var(rcu_data).lock;
  read_lock(current-rcu_read_lock_ptr);
  }
  preempt_enable();
  }
 
 Ok, here's a rather unsure question...
 
 Uh, is that a sleep violation if that is exclusively held since it
 can block within an atomic critical section (deadlock) ?

Hey, I wasn't joking when I said that I probably got something wrong!  ;-)

My current thought is that the preempt_disable()/preempt_enable() can
be dropped entirely.  Messes up any tool that browses through
-rcu_read_lock_nesting, but don't see any other problem.  Yet, anyway!

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 05:17:29AM -0800, Bill Huey wrote:
 On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
  On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
   5. Scalability -and- Realtime Response.
  ...
  
 void
 rcu_read_lock(void)
 {
 preempt_disable();
 if (current-rcu_read_lock_nesting++ == 0) {
 current-rcu_read_lock_ptr =
 __get_cpu_var(rcu_data).lock;
 read_lock(current-rcu_read_lock_ptr);
 }
 preempt_enable();
 }
  
  Ok, here's a rather unsure question...
  
  Uh, is that a sleep violation if that is exclusively held since it
  can block within an atomic critical section (deadlock) ?
 
 I'd like to note another problem. Mingo's current implementation of rt_mutex
 (super mutex for all blocking synchronization) is still missing reader counts
 and something like that would have to be implemented if you want to do 
 priority
 inheritance over blocks.
 
 This is going to throw a wrench into your implementation if you assume that.

If we need to do priority inheritance across the memory allocator, so
that high-priority tasks blocked waiting for memory pass their priority
on to RCU readers, agreed.  But I don't see any sign that real-time
preempt does this.

In absence of this, as Ingo noted, the fact that readers don't block
each other should make things be safe.  I think...

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Paul E. McKenney [EMAIL PROTECTED] wrote:

 preempt_disable();
 if (current-rcu_read_lock_nesting++ == 0) {
 current-rcu_read_lock_ptr =
 __get_cpu_var(rcu_data).lock;
 read_lock(current-rcu_read_lock_ptr);
 }
 preempt_enable();

 My current thought is that the preempt_disable()/preempt_enable() can
 be dropped entirely.  Messes up any tool that browses through
 -rcu_read_lock_nesting, but don't see any other problem.  Yet,
 anyway!

yeah - this sounds good. (We are not aiming for irq-safe RCU anyway, on
PREEMPT_RT.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Ingo Molnar

* Bill Huey [EMAIL PROTECTED] wrote:

 I'd like to note another problem. Mingo's current implementation of
 rt_mutex (super mutex for all blocking synchronization) is still
 missing reader counts and something like that would have to be
 implemented if you want to do priority inheritance over blocks.

i really have no intention to allow multiple readers for rt-mutexes. We
got away with that so far, and i'd like to keep it so. Imagine 100
threads all blocked in the same critical section (holding the read-lock)
when a highprio writer thread comes around: instant 100x latency to let
all of them roll forward. The only sane solution is to not allow
excessive concurrency. (That limits SMP scalability, but there's no
other choice i can see.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-Time Preemption and RCU

2005-03-18 Thread Paul E. McKenney
On Fri, Mar 18, 2005 at 08:49:03AM +0100, Ingo Molnar wrote:
 
 * Paul E. McKenney [EMAIL PROTECTED] wrote:
 
  Seems to me that it would be good to have an RCU implementation that
  meet the requirements of the Real-Time Preemption patch, but that is
  100% compatible with the classic RCU API.  Such an implementation
  must meet a number of requirements, which are listed at the end of
  this message (search for REQUIREMENTS).
 
 [ Wow. you must be a secret telepatic mind-reader - yesterday i was
   thinking about mailing you, because my approach to RCU preemptability
   (the API variants) clearly sucked and caused problems all around, both
   in terms of maintainability and in terms of stability and
   scalability. ]

Wish I could claim that, but must credit dumb luck.  ;-)

  5.  The final version, which both scales and meets realtime
  requirements, as well as exactly matching the classic RCU
  API.
  
  I have tested this approach, but in user-level scaffolding.  All of
  these implementations should therefore be regarded with great
  suspicion: untested, probably don't even compile.  Besides which, I
  certainly can't claim to fully understand the real-time preempt patch,
  so I am bound to have gotten something wrong somewhere.  In any case,
  none of these implementations are a suitable replacement for classic
  RCU on large servers, since they acquire locks in the RCU read-side
  critical sections. However, they should scale enough to support small
  SMP systems, inflicting only a modest performance penalty.
 
 basically for PREEMPT_RT the only constraint is that RCU sections should
 be preemptable. Whatever the performance cost. If PREEMPT_RT is merged
 into the upstream kernel then it will (at least initially) be at a
 status similar to NOMMU: it will be tolerated as long as it causes no
 'drag' on the main code. The RCU API variants i introduced clearly
 violated this requirement, and were my #1 worry wrt. upstream
 mergability.

Fair enough!

  I believe that implementation #5 is most appropriate for real-time
  preempt kernels. [...]
 
 yeah, agreed - it looks perfect - both the read and write side is
 preemptable. Can i just plug the code you sent into rcupudate.c and
 expect it to work, or would you like to send a patch? If you prefer you
 can make it an unconditional patch against an upstream kernel to keep
 things simple for you - i'll then massage it to be properly PREEMPT_RT
 dependent.

I will give it a shot, but feel free to grab anything at any point,
starting either from the patch I will send in a bit or from my earlier
email.  Do whatever works best for you.

For the patch, here are my questions:

o   What is the best way to select between classic RCU and this
scheme?

1.  Massive #ifdef across rcupdate.c

2.  Create an rcupdate_rt.c and browbeat the build system
into picking one or the other (no clue if this is
possible...)

3.  Create an rcupdate_rt.c and rely on the linker to pick
one or the other, with rcupdate.h generating different
external symbol names to make the choice.

Left to myself, I would grit my teeth and do #1, since I
understand how to do it.  Flames welcome, as long as accompanied
by good advice!

o   How best to interface to OOM?  Left to myself, I leave this
for later.  ;-)

I will take the cowardly approach of patching against the upstream kernel.

  [...] In theory, #3 might be appropriate, but if I understand the
  real-time preempt implementation of reader-writer lock, it will not
  perform well if there are long RCU read-side critical sections, even
  in UP kernels.
 
 all RCU-locked sections must be preemptable in -RT.  Basically RCU is a
 mainstream API that is used by lots of code and will be introduced in
 many other areas as well. From the -RT kernel's POV sees this as an
 'uncontrollable latency source', which keeps introducing critical
 sections. One major goal of PREEMPT_RT is to convert all popular
 critical section APIs into preemptible sections, so that the amount of
 code that is non-preemptable is drastically reduced and can be managed
 (and thus can be trusted). This goal has a higher priority than any
 performance consideration, because it doesnt matter what performance you
 have, if you cannot trust the kernel to be deterministic.

Makes sense to me!

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >