Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Andrew Morton
Christoph Lameter <[EMAIL PROTECTED]> wrote:
>
> On Sun, 20 Mar 2005, Andrew Morton wrote:
> 
> > > Hope Andrew is going to take the patch this time.
> >
> > Hope Kenneth is going to test the alternate del_timer_sync patches in next
> > -mm ;)
> 
> BTW Why are we going through this? Oleg has posted a much better solution
> to this issue yersteday AFAIK.

That is what I was referring to.

Those patches need to be reviewed, performance tested and stability tested.
It's appropriate that interested parties do that work, please.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Andrew Morton
Christoph Lameter <[EMAIL PROTECTED]> wrote:
>
> On Sun, 20 Mar 2005, Andrew Morton wrote:
> 
> > "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
> > >
> > > We did exactly the same thing about 10 months back.  Nice to
> > > see that independent people came up with exactly the same
> > > solution that we proposed 10 months back.
> >
> > Well the same question applies.  Christoph, which code is calling
> > del_timer_sync() so often that you noticed?
> 
> Ummm. I have to ask those who brought this to my attention. Are you
> looking for the application under which del_timer_sync showed up in
> profiling or the kernel subsystem?

I was wondering which part of the kernel was hammering del_timer_sync() so
hard.  I guess we could work that out from a description of the workload.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Christoph Lameter
On Sun, 20 Mar 2005, Andrew Morton wrote:

> > Hope Andrew is going to take the patch this time.
>
> Hope Kenneth is going to test the alternate del_timer_sync patches in next
> -mm ;)

BTW Why are we going through this? Oleg has posted a much better solution
to this issue yersteday AFAIK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Christoph Lameter
On Sun, 20 Mar 2005, Andrew Morton wrote:

> "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
> >
> > We did exactly the same thing about 10 months back.  Nice to
> > see that independent people came up with exactly the same
> > solution that we proposed 10 months back.
>
> Well the same question applies.  Christoph, which code is calling
> del_timer_sync() so often that you noticed?

Ummm. I have to ask those who brought this to my attention. Are you
looking for the application under which del_timer_sync showed up in
profiling or the kernel subsystem?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Andrew Morton
"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> We did exactly the same thing about 10 months back.  Nice to
> see that independent people came up with exactly the same
> solution that we proposed 10 months back.

Well the same question applies.  Christoph, which code is calling
del_timer_sync() so often that you noticed?

>  In fact, this patch
> is line-by-line identical to the one we post.

I assume that's not a coincidence.

> Hope Andrew is going to take the patch this time.

Hope Kenneth is going to test the alternate del_timer_sync patches in next
-mm ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] del_timer_sync scalability patch

2005-03-20 Thread Chen, Kenneth W
We did exactly the same thing about 10 months back.  Nice to
see that independent people came up with exactly the same
solution that we proposed 10 months back.  In fact, this patch
is line-by-line identical to the one we post.

Hope Andrew is going to take the patch this time.

See our original posting:
http://marc.theaimsgroup.com/?l=linux-kernel=108422767319822=2

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] del_timer_sync scalability patch

2005-03-20 Thread Chen, Kenneth W
We did exactly the same thing about 10 months back.  Nice to
see that independent people came up with exactly the same
solution that we proposed 10 months back.  In fact, this patch
is line-by-line identical to the one we post.

Hope Andrew is going to take the patch this time.

See our original posting:
http://marc.theaimsgroup.com/?l=linux-kernelm=108422767319822w=2

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Andrew Morton
Chen, Kenneth W [EMAIL PROTECTED] wrote:

 We did exactly the same thing about 10 months back.  Nice to
 see that independent people came up with exactly the same
 solution that we proposed 10 months back.

Well the same question applies.  Christoph, which code is calling
del_timer_sync() so often that you noticed?

  In fact, this patch
 is line-by-line identical to the one we post.

I assume that's not a coincidence.

 Hope Andrew is going to take the patch this time.

Hope Kenneth is going to test the alternate del_timer_sync patches in next
-mm ;)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Christoph Lameter
On Sun, 20 Mar 2005, Andrew Morton wrote:

 Chen, Kenneth W [EMAIL PROTECTED] wrote:
 
  We did exactly the same thing about 10 months back.  Nice to
  see that independent people came up with exactly the same
  solution that we proposed 10 months back.

 Well the same question applies.  Christoph, which code is calling
 del_timer_sync() so often that you noticed?

Ummm. I have to ask those who brought this to my attention. Are you
looking for the application under which del_timer_sync showed up in
profiling or the kernel subsystem?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Christoph Lameter
On Sun, 20 Mar 2005, Andrew Morton wrote:

  Hope Andrew is going to take the patch this time.

 Hope Kenneth is going to test the alternate del_timer_sync patches in next
 -mm ;)

BTW Why are we going through this? Oleg has posted a much better solution
to this issue yersteday AFAIK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Andrew Morton
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Sun, 20 Mar 2005, Andrew Morton wrote:
 
  Chen, Kenneth W [EMAIL PROTECTED] wrote:
  
   We did exactly the same thing about 10 months back.  Nice to
   see that independent people came up with exactly the same
   solution that we proposed 10 months back.
 
  Well the same question applies.  Christoph, which code is calling
  del_timer_sync() so often that you noticed?
 
 Ummm. I have to ask those who brought this to my attention. Are you
 looking for the application under which del_timer_sync showed up in
 profiling or the kernel subsystem?

I was wondering which part of the kernel was hammering del_timer_sync() so
hard.  I guess we could work that out from a description of the workload.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-20 Thread Andrew Morton
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Sun, 20 Mar 2005, Andrew Morton wrote:
 
   Hope Andrew is going to take the patch this time.
 
  Hope Kenneth is going to test the alternate del_timer_sync patches in next
  -mm ;)
 
 BTW Why are we going through this? Oleg has posted a much better solution
 to this issue yersteday AFAIK.

That is what I was referring to.

Those patches need to be reviewed, performance tested and stability tested.
It's appropriate that interested parties do that work, please.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-15 Thread Ingo Molnar

* Christoph Lameter <[EMAIL PROTECTED]> wrote:

> The following patch removes the magic in the timer_list structure
> (Andrew suggested that we may not need it anymore) and replaces it
> with two u8 variables that give us some additional state of the timer

The 'remove the magic' observation is not a 'backdoor' to introduce new
fields 'for free'. So please dont mix the two things into one patch.

> +   u8 running; /* function is currently executing */
> +   u8 shutdown;/* do not schedule this timer */

it may as well be cleaner to do the timer->base_running thing after all,
and to do this in __run_timers():

timer->base = NULL;
timer->base_running = base;

note that we cannot clear ->base_running in __run_timers() [we dont own
any access to the timer anymore, it may be kfree()d, etc.], so
del_timer_sync() has to make sure that the timer->base_running info is
not stale - it is a hint only. But it looks doable, and it should solve
the NUMA scalability problem.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-15 Thread Oleg Nesterov
Christoph Lameter wrote:
>
> @@ -476,6 +454,7 @@ repeat:
>   }
>   }
>   spin_lock_irq(>lock);
> + timer->running = 0;
^^
>   goto repeat;
>   }
>   }

This is imho wrong. The timer probably don't exist when
timer_list->function returns.

I mean, timer_list->function could deallocate timer's memory.

Oleg.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-15 Thread Oleg Nesterov
Christoph Lameter wrote:
>
> On Sun, 13 Mar 2005, Oleg Nesterov wrote:
>
> > I suspect that del_timer_sync() in its current form is racy.
> >
...snip...
> > next timer interrupt, __run_timers() picks
> > this timer again, sets timer->base = NULL
  ^^^
> >
> > if (timer_pending(timer))   
> > // no, timer->base == NULL
>
> timer->base is != NULL because the timer has rescheduled itself.
> __mod_timer sets timer->bBase

Christoph, please look again. Yes, __mod_timer sets timer->base,
but it is cleared in the _next_ timer interrupt on CPU 0.

Andrew, Ingo, what do you think?

Oleg.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-15 Thread Christoph Lameter
How about this take on the problem?

When a potential periodic timer is deleted through timer_del_sync, all cpus
are scanned to determine if the timer is running on that cpu. In a NUMA
configuration doing so will cause NUMA interlink traffic which limits the
scalability of timers.

The following patch removes the magic in the timer_list structure (Andrew
suggested that we may not need it anymore) and replaces it with two u8
variables that give us some additional state of the timer

running -> Set if the timer function is currently executing
shutdown-> Set if del_timer_sync is running and the timer
   should not be rescheduled.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.11/include/linux/timer.h
===
--- linux-2.6.11.orig/include/linux/timer.h 2005-03-14 14:17:51.671533904 
-0800
+++ linux-2.6.11/include/linux/timer.h  2005-03-14 14:41:49.640929328 -0800
@@ -13,22 +13,22 @@ struct timer_list {
unsigned long expires;

spinlock_t lock;
-   unsigned long magic;

void (*function)(unsigned long);
unsigned long data;

-   struct tvec_t_base_s *base;
+   struct tvec_t_base_s *base; /* ==NULL if timer not scheduled */
+   u8 running; /* function is currently executing */
+   u8 shutdown;/* do not schedule this timer */
 };

-#define TIMER_MAGIC0x4b87ad6e
-
 #define TIMER_INITIALIZER(_function, _expires, _data) {\
.function = (_function),\
.expires = (_expires),  \
.data = (_data),\
.base = NULL,   \
-   .magic = TIMER_MAGIC,   \
+   .running = 0,   \
+   .shutdown = 0,  \
.lock = SPIN_LOCK_UNLOCKED, \
}

@@ -42,7 +42,8 @@ struct timer_list {
 static inline void init_timer(struct timer_list * timer)
 {
timer->base = NULL;
-   timer->magic = TIMER_MAGIC;
+   timer->running = 0;
+   timer->shutdown = 0;
spin_lock_init(>lock);
 }

Index: linux-2.6.11/kernel/timer.c
===
--- linux-2.6.11.orig/kernel/timer.c2005-03-14 14:17:51.671533904 -0800
+++ linux-2.6.11/kernel/timer.c 2005-03-14 14:57:24.613791848 -0800
@@ -89,31 +89,6 @@ static inline void set_running_timer(tve
 /* Fake initialization */
 static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };

-static void check_timer_failed(struct timer_list *timer)
-{
-   static int whine_count;
-   if (whine_count < 16) {
-   whine_count++;
-   printk("Uninitialised timer!\n");
-   printk("This is just a warning.  Your computer is OK\n");
-   printk("function=0x%p, data=0x%lx\n",
-   timer->function, timer->data);
-   dump_stack();
-   }
-   /*
-* Now fix it up
-*/
-   spin_lock_init(>lock);
-   timer->magic = TIMER_MAGIC;
-}
-
-static inline void check_timer(struct timer_list *timer)
-{
-   if (timer->magic != TIMER_MAGIC)
-   check_timer_failed(timer);
-}
-
-
 static void internal_add_timer(tvec_base_t *base, struct timer_list *timer)
 {
unsigned long expires = timer->expires;
@@ -164,8 +139,6 @@ int __mod_timer(struct timer_list *timer

BUG_ON(!timer->function);

-   check_timer(timer);
-
spin_lock_irqsave(>lock, flags);
new_base = &__get_cpu_var(tvec_bases);
 repeat:
@@ -208,8 +181,10 @@ repeat:
ret = 1;
}
timer->expires = expires;
-   internal_add_timer(new_base, timer);
-   timer->base = new_base;
+   if (!timer->shutdown) {
+   internal_add_timer(new_base, timer);
+   timer->base = new_base;
+   }

if (old_base && (new_base != old_base))
spin_unlock(_base->lock);
@@ -235,7 +210,8 @@ void add_timer_on(struct timer_list *tim

BUG_ON(timer_pending(timer) || !timer->function);

-   check_timer(timer);
+   if (timer->shutdown)
+   return;

spin_lock_irqsave(>lock, flags);
internal_add_timer(base, timer);
@@ -267,8 +243,6 @@ int mod_timer(struct timer_list *timer,
 {
BUG_ON(!timer->function);

-   check_timer(timer);
-
/*
 * This is a common optimization triggered by the
 * networking code - if the timer is re-modified
@@ -298,8 +272,6 @@ int del_timer(struct timer_list *timer)
unsigned long flags;
tvec_base_t *base;

-   check_timer(timer);
-
 repeat:
base = timer->base;
if 

Re: [patch] del_timer_sync scalability patch

2005-03-15 Thread Christoph Lameter
How about this take on the problem?

When a potential periodic timer is deleted through timer_del_sync, all cpus
are scanned to determine if the timer is running on that cpu. In a NUMA
configuration doing so will cause NUMA interlink traffic which limits the
scalability of timers.

The following patch removes the magic in the timer_list structure (Andrew
suggested that we may not need it anymore) and replaces it with two u8
variables that give us some additional state of the timer

running - Set if the timer function is currently executing
shutdown- Set if del_timer_sync is running and the timer
   should not be rescheduled.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.11/include/linux/timer.h
===
--- linux-2.6.11.orig/include/linux/timer.h 2005-03-14 14:17:51.671533904 
-0800
+++ linux-2.6.11/include/linux/timer.h  2005-03-14 14:41:49.640929328 -0800
@@ -13,22 +13,22 @@ struct timer_list {
unsigned long expires;

spinlock_t lock;
-   unsigned long magic;

void (*function)(unsigned long);
unsigned long data;

-   struct tvec_t_base_s *base;
+   struct tvec_t_base_s *base; /* ==NULL if timer not scheduled */
+   u8 running; /* function is currently executing */
+   u8 shutdown;/* do not schedule this timer */
 };

-#define TIMER_MAGIC0x4b87ad6e
-
 #define TIMER_INITIALIZER(_function, _expires, _data) {\
.function = (_function),\
.expires = (_expires),  \
.data = (_data),\
.base = NULL,   \
-   .magic = TIMER_MAGIC,   \
+   .running = 0,   \
+   .shutdown = 0,  \
.lock = SPIN_LOCK_UNLOCKED, \
}

@@ -42,7 +42,8 @@ struct timer_list {
 static inline void init_timer(struct timer_list * timer)
 {
timer-base = NULL;
-   timer-magic = TIMER_MAGIC;
+   timer-running = 0;
+   timer-shutdown = 0;
spin_lock_init(timer-lock);
 }

Index: linux-2.6.11/kernel/timer.c
===
--- linux-2.6.11.orig/kernel/timer.c2005-03-14 14:17:51.671533904 -0800
+++ linux-2.6.11/kernel/timer.c 2005-03-14 14:57:24.613791848 -0800
@@ -89,31 +89,6 @@ static inline void set_running_timer(tve
 /* Fake initialization */
 static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };

-static void check_timer_failed(struct timer_list *timer)
-{
-   static int whine_count;
-   if (whine_count  16) {
-   whine_count++;
-   printk(Uninitialised timer!\n);
-   printk(This is just a warning.  Your computer is OK\n);
-   printk(function=0x%p, data=0x%lx\n,
-   timer-function, timer-data);
-   dump_stack();
-   }
-   /*
-* Now fix it up
-*/
-   spin_lock_init(timer-lock);
-   timer-magic = TIMER_MAGIC;
-}
-
-static inline void check_timer(struct timer_list *timer)
-{
-   if (timer-magic != TIMER_MAGIC)
-   check_timer_failed(timer);
-}
-
-
 static void internal_add_timer(tvec_base_t *base, struct timer_list *timer)
 {
unsigned long expires = timer-expires;
@@ -164,8 +139,6 @@ int __mod_timer(struct timer_list *timer

BUG_ON(!timer-function);

-   check_timer(timer);
-
spin_lock_irqsave(timer-lock, flags);
new_base = __get_cpu_var(tvec_bases);
 repeat:
@@ -208,8 +181,10 @@ repeat:
ret = 1;
}
timer-expires = expires;
-   internal_add_timer(new_base, timer);
-   timer-base = new_base;
+   if (!timer-shutdown) {
+   internal_add_timer(new_base, timer);
+   timer-base = new_base;
+   }

if (old_base  (new_base != old_base))
spin_unlock(old_base-lock);
@@ -235,7 +210,8 @@ void add_timer_on(struct timer_list *tim

BUG_ON(timer_pending(timer) || !timer-function);

-   check_timer(timer);
+   if (timer-shutdown)
+   return;

spin_lock_irqsave(base-lock, flags);
internal_add_timer(base, timer);
@@ -267,8 +243,6 @@ int mod_timer(struct timer_list *timer,
 {
BUG_ON(!timer-function);

-   check_timer(timer);
-
/*
 * This is a common optimization triggered by the
 * networking code - if the timer is re-modified
@@ -298,8 +272,6 @@ int del_timer(struct timer_list *timer)
unsigned long flags;
tvec_base_t *base;

-   check_timer(timer);
-
 repeat:
base = timer-base;
if (!base)
@@ -337,35 

Re: [patch] del_timer_sync scalability patch

2005-03-15 Thread Oleg Nesterov
Christoph Lameter wrote:

 On Sun, 13 Mar 2005, Oleg Nesterov wrote:

  I suspect that del_timer_sync() in its current form is racy.
 
...snip...
  next timer interrupt, __run_timers() picks
  this timer again, sets timer-base = NULL
  ^^^
 
  if (timer_pending(timer))   
  // no, timer-base == NULL

 timer-base is != NULL because the timer has rescheduled itself.
 __mod_timer sets timer-bBase

Christoph, please look again. Yes, __mod_timer sets timer-base,
but it is cleared in the _next_ timer interrupt on CPU 0.

Andrew, Ingo, what do you think?

Oleg.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-15 Thread Oleg Nesterov
Christoph Lameter wrote:

 @@ -476,6 +454,7 @@ repeat:
   }
   }
   spin_lock_irq(base-lock);
 + timer-running = 0;
^^
   goto repeat;
   }
   }

This is imho wrong. The timer probably don't exist when
timer_list-function returns.

I mean, timer_list-function could deallocate timer's memory.

Oleg.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-15 Thread Ingo Molnar

* Christoph Lameter [EMAIL PROTECTED] wrote:

 The following patch removes the magic in the timer_list structure
 (Andrew suggested that we may not need it anymore) and replaces it
 with two u8 variables that give us some additional state of the timer

The 'remove the magic' observation is not a 'backdoor' to introduce new
fields 'for free'. So please dont mix the two things into one patch.

 +   u8 running; /* function is currently executing */
 +   u8 shutdown;/* do not schedule this timer */

it may as well be cleaner to do the timer-base_running thing after all,
and to do this in __run_timers():

timer-base = NULL;
timer-base_running = base;

note that we cannot clear -base_running in __run_timers() [we dont own
any access to the timer anymore, it may be kfree()d, etc.], so
del_timer_sync() has to make sure that the timer-base_running info is
not stale - it is a hint only. But it looks doable, and it should solve
the NUMA scalability problem.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-14 Thread Christoph Lameter

On Sun, 13 Mar 2005, Oleg Nesterov wrote:

> I suspect that del_timer_sync() in its current form is racy.
>
> CPU 0 CPU 1
>
> __run_timers() sets timer->base = NULL
>
>   del_timer_sync() starts, calls 
> del_timer(), it returns
>   because timer->base == NULL.
>
>   waits until the run is complete:
>   while 
> (base->running_timer == timer)
>   
> preempt_check_resched();
>
>   calls schedule(), or long 
> interrupt happens.
>
> timer reschedules itself, __run_timers()
> exits.
>
>   base->running_timer == NULL, 
> end of loop.
>
> next timer interrupt, __run_timers() picks
> this timer again, sets timer->base = NULL
>
>   if (timer_pending(timer))   
> // no, timer->base == NULL

timer->base is != NULL because the timer has rescheduled itself.
__mod_timer sets timer->bBase

>   goto del_again; 
> // not taken

Yes it is taken!

>
>   del_timer_sync() returns
>
> timer runs.

Nope.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-14 Thread Christoph Lameter

On Sun, 13 Mar 2005, Oleg Nesterov wrote:

 I suspect that del_timer_sync() in its current form is racy.

 CPU 0 CPU 1

 __run_timers() sets timer-base = NULL

   del_timer_sync() starts, calls 
 del_timer(), it returns
   because timer-base == NULL.

   waits until the run is complete:
   while 
 (base-running_timer == timer)
   
 preempt_check_resched();

   calls schedule(), or long 
 interrupt happens.

 timer reschedules itself, __run_timers()
 exits.

   base-running_timer == NULL, 
 end of loop.

 next timer interrupt, __run_timers() picks
 this timer again, sets timer-base = NULL

   if (timer_pending(timer))   
 // no, timer-base == NULL

timer-base is != NULL because the timer has rescheduled itself.
__mod_timer sets timer-bBase

   goto del_again; 
 // not taken

Yes it is taken!


   del_timer_sync() returns

 timer runs.

Nope.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-13 Thread Oleg Nesterov
I suspect that del_timer_sync() in its current form is racy.

CPU 0   CPU 1

__run_timers() sets timer->base = NULL

del_timer_sync() starts, calls 
del_timer(), it returns
because timer->base == NULL.

waits until the run is complete:
while 
(base->running_timer == timer)

preempt_check_resched();

calls schedule(), or long 
interrupt happens.

timer reschedules itself, __run_timers()
exits.

base->running_timer == NULL, 
end of loop.

next timer interrupt, __run_timers() picks
this timer again, sets timer->base = NULL

if (timer_pending(timer))   
// no, timer->base == NULL
goto del_again; 
// not taken

del_timer_sync() returns

timer runs.

No?

Oleg.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-13 Thread Oleg Nesterov
I suspect that del_timer_sync() in its current form is racy.

CPU 0   CPU 1

__run_timers() sets timer-base = NULL

del_timer_sync() starts, calls 
del_timer(), it returns
because timer-base == NULL.

waits until the run is complete:
while 
(base-running_timer == timer)

preempt_check_resched();

calls schedule(), or long 
interrupt happens.

timer reschedules itself, __run_timers()
exits.

base-running_timer == NULL, 
end of loop.

next timer interrupt, __run_timers() picks
this timer again, sets timer-base = NULL

if (timer_pending(timer))   
// no, timer-base == NULL
goto del_again; 
// not taken

del_timer_sync() returns

timer runs.

No?

Oleg.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-11 Thread Christoph Lameter
On Fri, 11 Mar 2005, Oleg Nesterov wrote:

> I think it is not enough to exchange these 2 lines in
> __run_timers, we also need barriers.

Maybe its best to drop last_running_timer as Ingo suggested.

Replace the magic with a flag that can be set to stop scheduling a timer
again.

Then del_timer_sync may

1. Set the flag not to reschedule
2. delete the timer
3. wait till the timer function is complete
4. (eventually verify that the timer is really gone)
5. reset the no reschedule flag

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-11 Thread Oleg Nesterov
Hello.

I am not sure, but I think this patch incorrect.

> @@ -466,6 +482,7 @@ repeat:
>   set_running_timer(base, timer);
>   smp_wmb();
>   timer->base = NULL;
--> WINDOW <--
> + set_last_running(timer, base);
>   spin_unlock_irq(>lock);

Suppose it is the first invocation of timer, so
timer->last_running == NULL.

What if del_timer_sync() happens in that window?

del_timer_sync:
del_timer();// timer->base == NULL, returns

base = timer->last_running;
if (base)   // no, it is still NULL
...

if (timer->base != NULL || timer->last_running != base)
goto del_again; // not taken

return;

I think it is not enough to exchange these 2 lines in
__run_timers, we also need barriers.

Oleg.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-11 Thread Oleg Nesterov
Hello.

I am not sure, but I think this patch incorrect.

 @@ -466,6 +482,7 @@ repeat:
   set_running_timer(base, timer);
   smp_wmb();
   timer-base = NULL;
-- WINDOW --
 + set_last_running(timer, base);
   spin_unlock_irq(base-lock);

Suppose it is the first invocation of timer, so
timer-last_running == NULL.

What if del_timer_sync() happens in that window?

del_timer_sync:
del_timer();// timer-base == NULL, returns

base = timer-last_running;
if (base)   // no, it is still NULL
...

if (timer-base != NULL || timer-last_running != base)
goto del_again; // not taken

return;

I think it is not enough to exchange these 2 lines in
__run_timers, we also need barriers.

Oleg.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-11 Thread Christoph Lameter
On Fri, 11 Mar 2005, Oleg Nesterov wrote:

 I think it is not enough to exchange these 2 lines in
 __run_timers, we also need barriers.

Maybe its best to drop last_running_timer as Ingo suggested.

Replace the magic with a flag that can be set to stop scheduling a timer
again.

Then del_timer_sync may

1. Set the flag not to reschedule
2. delete the timer
3. wait till the timer function is complete
4. (eventually verify that the timer is really gone)
5. reset the no reschedule flag

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-08 Thread Christoph Lameter
On Tue, 8 Mar 2005, Ingo Molnar wrote:

> > >  The following patch makes the timer remember where the timer was last
> > >  started. It is then possible to only wait for the completion of the timer
> > >  on that specific cpu.
>
> i'm not sure about this. The patch adds one more pointer to a very
> frequently used and frequently embedded data structure (struct
> timer_list), for the benefit of a rarely used API variant
> (timer_del_sync()).

Without that pointer there is no information in the timer_list struct as
to where the timer is/was running since base is set to NULL when the timer
function is executed.

> Furthermore, timer->base itself is affine too, a timer always runs on
> the CPU belonging to timer->base. So a more scalable variant of
> del_timer_sync() would perhaps be possible by carefully deleting the
> timer and only going into the full loop if the timer was deleted before.
> (and in which case semantics require us to synchronize on timer
> execution.) Or we could skip the full loop altogether and just
> synchronize with timer execution if _we_ deleted the timer. This should
> work fine as far as itimers are concerned.

There is no easy way to distinguish the case that a timer was deleted
from the case that the timer function is running (which may reschedule
the periodic timer). A scan over all the bases is required to insure that
the timer function is not being executed right now.

If we only would sync when we deleted the timer (meaning timer->base was
!= NULL) then we would not synchronize when the timer function
is executing (and therefore timer->base == NULL). However, we need to sync
exactly in that case because a periodic timer may set timer->base again
when scheduling the next event. That is the main stated purpose of
del_timer_sync.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-08 Thread Christoph Lameter
On Tue, 8 Mar 2005, Andrew Morton wrote:

> If we're prepared to rule that a timer handler is not allowed to do
> add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't
> it?

The process may be rescheduled to run on different processor. Then the
add_timer() function (called from schedule_next_timer in
kernel/posix-timers.c) will also add the timer to the new processor
because it is called from the signal handling code. So I think that it
is possible that a periodic timer will be scheduled on different
processors.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-08 Thread Andrew Morton
Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> 
> * Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > Christoph Lameter <[EMAIL PROTECTED]> wrote:
> > >
> > > When a potential periodic timer is deleted through timer_del_sync, all
> > >  cpus are scanned to determine if the timer is running on that cpu. In a
> > >  NUMA configuration doing so will cause NUMA interlink traffic which 
> > > limits
> > >  the scalability of timers.
> > > 
> > >  The following patch makes the timer remember where the timer was last
> > >  started. It is then possible to only wait for the completion of the timer
> > >  on that specific cpu.
> 
> i'm not sure about this. The patch adds one more pointer to a very
> frequently used and frequently embedded data structure (struct
> timer_list), for the benefit of a rarely used API variant
> (timer_del_sync()).

True.

(We can delete timer.magic and check_timer() now.  It has served its purpose)

> Furthermore, timer->base itself is affine too, a timer always runs on
> the CPU belonging to timer->base. So a more scalable variant of
> del_timer_sync() would perhaps be possible by carefully deleting the
> timer and only going into the full loop if the timer was deleted before. 
> (and in which case semantics require us to synchronize on timer
> execution.) Or we could skip the full loop altogether and just
> synchronize with timer execution if _we_ deleted the timer. This should
> work fine as far as itimers are concerned.

If we're prepared to rule that a timer handler is not allowed to do
add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't
it?

That should make things simpler?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-08 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> Christoph Lameter <[EMAIL PROTECTED]> wrote:
> >
> > When a potential periodic timer is deleted through timer_del_sync, all
> >  cpus are scanned to determine if the timer is running on that cpu. In a
> >  NUMA configuration doing so will cause NUMA interlink traffic which limits
> >  the scalability of timers.
> > 
> >  The following patch makes the timer remember where the timer was last
> >  started. It is then possible to only wait for the completion of the timer
> >  on that specific cpu.

i'm not sure about this. The patch adds one more pointer to a very
frequently used and frequently embedded data structure (struct
timer_list), for the benefit of a rarely used API variant
(timer_del_sync()).

Furthermore, timer->base itself is affine too, a timer always runs on
the CPU belonging to timer->base. So a more scalable variant of
del_timer_sync() would perhaps be possible by carefully deleting the
timer and only going into the full loop if the timer was deleted before. 
(and in which case semantics require us to synchronize on timer
execution.) Or we could skip the full loop altogether and just
synchronize with timer execution if _we_ deleted the timer. This should
work fine as far as itimers are concerned.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-08 Thread Ingo Molnar

* Andrew Morton [EMAIL PROTECTED] wrote:

 Christoph Lameter [EMAIL PROTECTED] wrote:
 
  When a potential periodic timer is deleted through timer_del_sync, all
   cpus are scanned to determine if the timer is running on that cpu. In a
   NUMA configuration doing so will cause NUMA interlink traffic which limits
   the scalability of timers.
  
   The following patch makes the timer remember where the timer was last
   started. It is then possible to only wait for the completion of the timer
   on that specific cpu.

i'm not sure about this. The patch adds one more pointer to a very
frequently used and frequently embedded data structure (struct
timer_list), for the benefit of a rarely used API variant
(timer_del_sync()).

Furthermore, timer-base itself is affine too, a timer always runs on
the CPU belonging to timer-base. So a more scalable variant of
del_timer_sync() would perhaps be possible by carefully deleting the
timer and only going into the full loop if the timer was deleted before. 
(and in which case semantics require us to synchronize on timer
execution.) Or we could skip the full loop altogether and just
synchronize with timer execution if _we_ deleted the timer. This should
work fine as far as itimers are concerned.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-08 Thread Andrew Morton
Ingo Molnar [EMAIL PROTECTED] wrote:

 
 * Andrew Morton [EMAIL PROTECTED] wrote:
 
  Christoph Lameter [EMAIL PROTECTED] wrote:
  
   When a potential periodic timer is deleted through timer_del_sync, all
cpus are scanned to determine if the timer is running on that cpu. In a
NUMA configuration doing so will cause NUMA interlink traffic which 
   limits
the scalability of timers.
   
The following patch makes the timer remember where the timer was last
started. It is then possible to only wait for the completion of the timer
on that specific cpu.
 
 i'm not sure about this. The patch adds one more pointer to a very
 frequently used and frequently embedded data structure (struct
 timer_list), for the benefit of a rarely used API variant
 (timer_del_sync()).

True.

(We can delete timer.magic and check_timer() now.  It has served its purpose)

 Furthermore, timer-base itself is affine too, a timer always runs on
 the CPU belonging to timer-base. So a more scalable variant of
 del_timer_sync() would perhaps be possible by carefully deleting the
 timer and only going into the full loop if the timer was deleted before. 
 (and in which case semantics require us to synchronize on timer
 execution.) Or we could skip the full loop altogether and just
 synchronize with timer execution if _we_ deleted the timer. This should
 work fine as far as itimers are concerned.

If we're prepared to rule that a timer handler is not allowed to do
add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't
it?

That should make things simpler?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-08 Thread Christoph Lameter
On Tue, 8 Mar 2005, Andrew Morton wrote:

 If we're prepared to rule that a timer handler is not allowed to do
 add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't
 it?

The process may be rescheduled to run on different processor. Then the
add_timer() function (called from schedule_next_timer in
kernel/posix-timers.c) will also add the timer to the new processor
because it is called from the signal handling code. So I think that it
is possible that a periodic timer will be scheduled on different
processors.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-08 Thread Christoph Lameter
On Tue, 8 Mar 2005, Ingo Molnar wrote:

The following patch makes the timer remember where the timer was last
started. It is then possible to only wait for the completion of the timer
on that specific cpu.

 i'm not sure about this. The patch adds one more pointer to a very
 frequently used and frequently embedded data structure (struct
 timer_list), for the benefit of a rarely used API variant
 (timer_del_sync()).

Without that pointer there is no information in the timer_list struct as
to where the timer is/was running since base is set to NULL when the timer
function is executed.

 Furthermore, timer-base itself is affine too, a timer always runs on
 the CPU belonging to timer-base. So a more scalable variant of
 del_timer_sync() would perhaps be possible by carefully deleting the
 timer and only going into the full loop if the timer was deleted before.
 (and in which case semantics require us to synchronize on timer
 execution.) Or we could skip the full loop altogether and just
 synchronize with timer execution if _we_ deleted the timer. This should
 work fine as far as itimers are concerned.

There is no easy way to distinguish the case that a timer was deleted
from the case that the timer function is running (which may reschedule
the periodic timer). A scan over all the bases is required to insure that
the timer function is not being executed right now.

If we only would sync when we deleted the timer (meaning timer-base was
!= NULL) then we would not synchronize when the timer function
is executing (and therefore timer-base == NULL). However, we need to sync
exactly in that case because a periodic timer may set timer-base again
when scheduling the next event. That is the main stated purpose of
del_timer_sync.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-07 Thread Andrew Morton
Christoph Lameter <[EMAIL PROTECTED]> wrote:
>
> When a potential periodic timer is deleted through timer_del_sync, all
>  cpus are scanned to determine if the timer is running on that cpu. In a
>  NUMA configuration doing so will cause NUMA interlink traffic which limits
>  the scalability of timers.
> 
>  The following patch makes the timer remember where the timer was last
>  started. It is then possible to only wait for the completion of the timer
>  on that specific cpu.

OK, I stared at this for a while and cannot see holes in it.  There may
still be one though - this code is tricky.  Probably any such holes would
be covered up by the fact that timers never migrate between CPUs anyway,
unless the handler chooses to do add_timer_on(), which I'm sure none do..

Ingo, could you take a look?  Especially what happens if the timer hops
CPUs after the 

/* Get where the timer ran last */
base = timer->last_running;

statement.  Do we have sufficient barriers in there for this?


A few tidyings which I'd suggest:




- Remove TIMER_INIT_LASTRUNNING: struct initialisers set unmentioned fields
  to zero anyway.

- Rename set_last_running() to timer_set_last_running, move it to timer.h.
  Then use it in init_timer() to avoid an ifdef.

Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 25-akpm/include/linux/timer.h |   21 ++---
 25-akpm/kernel/timer.c|   10 +-
 2 files changed, 11 insertions(+), 20 deletions(-)

diff -puN include/linux/timer.h~del_timer_sync-scalability-patch-tidy 
include/linux/timer.h
--- 25/include/linux/timer.h~del_timer_sync-scalability-patch-tidy  
2005-03-07 23:28:55.0 -0800
+++ 25-akpm/include/linux/timer.h   2005-03-07 23:30:23.0 -0800
@@ -26,12 +26,6 @@ struct timer_list {
 
 #define TIMER_MAGIC0x4b87ad6e
 
-#ifdef CONFIG_SMP
-#define TIMER_INIT_LASTRUNNING .last_running = NULL,
-#else
-#define TIMER_INIT_LASTRUNNING
-#endif
-
 #define TIMER_INITIALIZER(_function, _expires, _data) {\
.function = (_function),\
.expires = (_expires),  \
@@ -39,9 +33,16 @@ struct timer_list {
.base = NULL,   \
.magic = TIMER_MAGIC,   \
.lock = SPIN_LOCK_UNLOCKED, \
-   TIMER_INIT_LASTRUNNING  \
}
 
+static inline void
+timer_set_last_running(struct timer_list *timer, struct tvec_t_base_s *base)
+{
+#ifdef CONFIG_SMP
+   timer->last_running = base;
+#endif
+}
+
 /***
  * init_timer - initialize a timer.
  * @timer: the timer to be initialized
@@ -49,11 +50,9 @@ struct timer_list {
  * init_timer() must be done to a timer prior calling *any* of the
  * other timer functions.
  */
-static inline void init_timer(struct timer_list * timer)
+static inline void init_timer(struct timer_list *timer)
 {
-#ifdef CONFIG_SMP
-   timer->last_running = NULL;
-#endif
+   timer_set_last_running(timer, NULL);
timer->base = NULL;
timer->magic = TIMER_MAGIC;
spin_lock_init(>lock);
diff -puN kernel/timer.c~del_timer_sync-scalability-patch-tidy kernel/timer.c
--- 25/kernel/timer.c~del_timer_sync-scalability-patch-tidy 2005-03-07 
23:28:55.0 -0800
+++ 25-akpm/kernel/timer.c  2005-03-07 23:28:55.0 -0800
@@ -86,14 +86,6 @@ static inline void set_running_timer(tve
 #endif
 }
 
-static inline void set_last_running(struct timer_list *timer,
-   tvec_base_t *base)
-{
-#ifdef CONFIG_SMP
-   timer->last_running = base;
-#endif
-}
-
 /* Fake initialization */
 static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };
 
@@ -482,7 +474,7 @@ repeat:
set_running_timer(base, timer);
smp_wmb();
timer->base = NULL;
-   set_last_running(timer, base);
+   timer_set_last_running(timer, base);
spin_unlock_irq(>lock);
{
u32 preempt_count = preempt_count();
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] del_timer_sync scalability patch

2005-03-07 Thread Andrew Morton
Christoph Lameter [EMAIL PROTECTED] wrote:

 When a potential periodic timer is deleted through timer_del_sync, all
  cpus are scanned to determine if the timer is running on that cpu. In a
  NUMA configuration doing so will cause NUMA interlink traffic which limits
  the scalability of timers.
 
  The following patch makes the timer remember where the timer was last
  started. It is then possible to only wait for the completion of the timer
  on that specific cpu.

OK, I stared at this for a while and cannot see holes in it.  There may
still be one though - this code is tricky.  Probably any such holes would
be covered up by the fact that timers never migrate between CPUs anyway,
unless the handler chooses to do add_timer_on(), which I'm sure none do..

Ingo, could you take a look?  Especially what happens if the timer hops
CPUs after the 

/* Get where the timer ran last */
base = timer-last_running;

statement.  Do we have sufficient barriers in there for this?


A few tidyings which I'd suggest:




- Remove TIMER_INIT_LASTRUNNING: struct initialisers set unmentioned fields
  to zero anyway.

- Rename set_last_running() to timer_set_last_running, move it to timer.h.
  Then use it in init_timer() to avoid an ifdef.

Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 25-akpm/include/linux/timer.h |   21 ++---
 25-akpm/kernel/timer.c|   10 +-
 2 files changed, 11 insertions(+), 20 deletions(-)

diff -puN include/linux/timer.h~del_timer_sync-scalability-patch-tidy 
include/linux/timer.h
--- 25/include/linux/timer.h~del_timer_sync-scalability-patch-tidy  
2005-03-07 23:28:55.0 -0800
+++ 25-akpm/include/linux/timer.h   2005-03-07 23:30:23.0 -0800
@@ -26,12 +26,6 @@ struct timer_list {
 
 #define TIMER_MAGIC0x4b87ad6e
 
-#ifdef CONFIG_SMP
-#define TIMER_INIT_LASTRUNNING .last_running = NULL,
-#else
-#define TIMER_INIT_LASTRUNNING
-#endif
-
 #define TIMER_INITIALIZER(_function, _expires, _data) {\
.function = (_function),\
.expires = (_expires),  \
@@ -39,9 +33,16 @@ struct timer_list {
.base = NULL,   \
.magic = TIMER_MAGIC,   \
.lock = SPIN_LOCK_UNLOCKED, \
-   TIMER_INIT_LASTRUNNING  \
}
 
+static inline void
+timer_set_last_running(struct timer_list *timer, struct tvec_t_base_s *base)
+{
+#ifdef CONFIG_SMP
+   timer-last_running = base;
+#endif
+}
+
 /***
  * init_timer - initialize a timer.
  * @timer: the timer to be initialized
@@ -49,11 +50,9 @@ struct timer_list {
  * init_timer() must be done to a timer prior calling *any* of the
  * other timer functions.
  */
-static inline void init_timer(struct timer_list * timer)
+static inline void init_timer(struct timer_list *timer)
 {
-#ifdef CONFIG_SMP
-   timer-last_running = NULL;
-#endif
+   timer_set_last_running(timer, NULL);
timer-base = NULL;
timer-magic = TIMER_MAGIC;
spin_lock_init(timer-lock);
diff -puN kernel/timer.c~del_timer_sync-scalability-patch-tidy kernel/timer.c
--- 25/kernel/timer.c~del_timer_sync-scalability-patch-tidy 2005-03-07 
23:28:55.0 -0800
+++ 25-akpm/kernel/timer.c  2005-03-07 23:28:55.0 -0800
@@ -86,14 +86,6 @@ static inline void set_running_timer(tve
 #endif
 }
 
-static inline void set_last_running(struct timer_list *timer,
-   tvec_base_t *base)
-{
-#ifdef CONFIG_SMP
-   timer-last_running = base;
-#endif
-}
-
 /* Fake initialization */
 static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };
 
@@ -482,7 +474,7 @@ repeat:
set_running_timer(base, timer);
smp_wmb();
timer-base = NULL;
-   set_last_running(timer, base);
+   timer_set_last_running(timer, base);
spin_unlock_irq(base-lock);
{
u32 preempt_count = preempt_count();
_

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/