Re: [patch] del_timer_sync scalability patch
Christoph Lameter <[EMAIL PROTECTED]> wrote: > > On Sun, 20 Mar 2005, Andrew Morton wrote: > > > > Hope Andrew is going to take the patch this time. > > > > Hope Kenneth is going to test the alternate del_timer_sync patches in next > > -mm ;) > > BTW Why are we going through this? Oleg has posted a much better solution > to this issue yersteday AFAIK. That is what I was referring to. Those patches need to be reviewed, performance tested and stability tested. It's appropriate that interested parties do that work, please. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Christoph Lameter <[EMAIL PROTECTED]> wrote: > > On Sun, 20 Mar 2005, Andrew Morton wrote: > > > "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote: > > > > > > We did exactly the same thing about 10 months back. Nice to > > > see that independent people came up with exactly the same > > > solution that we proposed 10 months back. > > > > Well the same question applies. Christoph, which code is calling > > del_timer_sync() so often that you noticed? > > Ummm. I have to ask those who brought this to my attention. Are you > looking for the application under which del_timer_sync showed up in > profiling or the kernel subsystem? I was wondering which part of the kernel was hammering del_timer_sync() so hard. I guess we could work that out from a description of the workload. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Sun, 20 Mar 2005, Andrew Morton wrote: > > Hope Andrew is going to take the patch this time. > > Hope Kenneth is going to test the alternate del_timer_sync patches in next > -mm ;) BTW Why are we going through this? Oleg has posted a much better solution to this issue yersteday AFAIK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Sun, 20 Mar 2005, Andrew Morton wrote: > "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote: > > > > We did exactly the same thing about 10 months back. Nice to > > see that independent people came up with exactly the same > > solution that we proposed 10 months back. > > Well the same question applies. Christoph, which code is calling > del_timer_sync() so often that you noticed? Ummm. I have to ask those who brought this to my attention. Are you looking for the application under which del_timer_sync showed up in profiling or the kernel subsystem? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote: > > We did exactly the same thing about 10 months back. Nice to > see that independent people came up with exactly the same > solution that we proposed 10 months back. Well the same question applies. Christoph, which code is calling del_timer_sync() so often that you noticed? > In fact, this patch > is line-by-line identical to the one we post. I assume that's not a coincidence. > Hope Andrew is going to take the patch this time. Hope Kenneth is going to test the alternate del_timer_sync patches in next -mm ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] del_timer_sync scalability patch
We did exactly the same thing about 10 months back. Nice to see that independent people came up with exactly the same solution that we proposed 10 months back. In fact, this patch is line-by-line identical to the one we post. Hope Andrew is going to take the patch this time. See our original posting: http://marc.theaimsgroup.com/?l=linux-kernel=108422767319822=2 - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] del_timer_sync scalability patch
We did exactly the same thing about 10 months back. Nice to see that independent people came up with exactly the same solution that we proposed 10 months back. In fact, this patch is line-by-line identical to the one we post. Hope Andrew is going to take the patch this time. See our original posting: http://marc.theaimsgroup.com/?l=linux-kernelm=108422767319822w=2 - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Chen, Kenneth W [EMAIL PROTECTED] wrote: We did exactly the same thing about 10 months back. Nice to see that independent people came up with exactly the same solution that we proposed 10 months back. Well the same question applies. Christoph, which code is calling del_timer_sync() so often that you noticed? In fact, this patch is line-by-line identical to the one we post. I assume that's not a coincidence. Hope Andrew is going to take the patch this time. Hope Kenneth is going to test the alternate del_timer_sync patches in next -mm ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Sun, 20 Mar 2005, Andrew Morton wrote: Chen, Kenneth W [EMAIL PROTECTED] wrote: We did exactly the same thing about 10 months back. Nice to see that independent people came up with exactly the same solution that we proposed 10 months back. Well the same question applies. Christoph, which code is calling del_timer_sync() so often that you noticed? Ummm. I have to ask those who brought this to my attention. Are you looking for the application under which del_timer_sync showed up in profiling or the kernel subsystem? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Sun, 20 Mar 2005, Andrew Morton wrote: Hope Andrew is going to take the patch this time. Hope Kenneth is going to test the alternate del_timer_sync patches in next -mm ;) BTW Why are we going through this? Oleg has posted a much better solution to this issue yersteday AFAIK. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Christoph Lameter [EMAIL PROTECTED] wrote: On Sun, 20 Mar 2005, Andrew Morton wrote: Chen, Kenneth W [EMAIL PROTECTED] wrote: We did exactly the same thing about 10 months back. Nice to see that independent people came up with exactly the same solution that we proposed 10 months back. Well the same question applies. Christoph, which code is calling del_timer_sync() so often that you noticed? Ummm. I have to ask those who brought this to my attention. Are you looking for the application under which del_timer_sync showed up in profiling or the kernel subsystem? I was wondering which part of the kernel was hammering del_timer_sync() so hard. I guess we could work that out from a description of the workload. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Christoph Lameter [EMAIL PROTECTED] wrote: On Sun, 20 Mar 2005, Andrew Morton wrote: Hope Andrew is going to take the patch this time. Hope Kenneth is going to test the alternate del_timer_sync patches in next -mm ;) BTW Why are we going through this? Oleg has posted a much better solution to this issue yersteday AFAIK. That is what I was referring to. Those patches need to be reviewed, performance tested and stability tested. It's appropriate that interested parties do that work, please. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
* Christoph Lameter <[EMAIL PROTECTED]> wrote: > The following patch removes the magic in the timer_list structure > (Andrew suggested that we may not need it anymore) and replaces it > with two u8 variables that give us some additional state of the timer The 'remove the magic' observation is not a 'backdoor' to introduce new fields 'for free'. So please dont mix the two things into one patch. > + u8 running; /* function is currently executing */ > + u8 shutdown;/* do not schedule this timer */ it may as well be cleaner to do the timer->base_running thing after all, and to do this in __run_timers(): timer->base = NULL; timer->base_running = base; note that we cannot clear ->base_running in __run_timers() [we dont own any access to the timer anymore, it may be kfree()d, etc.], so del_timer_sync() has to make sure that the timer->base_running info is not stale - it is a hint only. But it looks doable, and it should solve the NUMA scalability problem. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Christoph Lameter wrote: > > @@ -476,6 +454,7 @@ repeat: > } > } > spin_lock_irq(>lock); > + timer->running = 0; ^^ > goto repeat; > } > } This is imho wrong. The timer probably don't exist when timer_list->function returns. I mean, timer_list->function could deallocate timer's memory. Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Christoph Lameter wrote: > > On Sun, 13 Mar 2005, Oleg Nesterov wrote: > > > I suspect that del_timer_sync() in its current form is racy. > > ...snip... > > next timer interrupt, __run_timers() picks > > this timer again, sets timer->base = NULL ^^^ > > > > if (timer_pending(timer)) > > // no, timer->base == NULL > > timer->base is != NULL because the timer has rescheduled itself. > __mod_timer sets timer->bBase Christoph, please look again. Yes, __mod_timer sets timer->base, but it is cleared in the _next_ timer interrupt on CPU 0. Andrew, Ingo, what do you think? Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
How about this take on the problem? When a potential periodic timer is deleted through timer_del_sync, all cpus are scanned to determine if the timer is running on that cpu. In a NUMA configuration doing so will cause NUMA interlink traffic which limits the scalability of timers. The following patch removes the magic in the timer_list structure (Andrew suggested that we may not need it anymore) and replaces it with two u8 variables that give us some additional state of the timer running -> Set if the timer function is currently executing shutdown-> Set if del_timer_sync is running and the timer should not be rescheduled. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.11/include/linux/timer.h === --- linux-2.6.11.orig/include/linux/timer.h 2005-03-14 14:17:51.671533904 -0800 +++ linux-2.6.11/include/linux/timer.h 2005-03-14 14:41:49.640929328 -0800 @@ -13,22 +13,22 @@ struct timer_list { unsigned long expires; spinlock_t lock; - unsigned long magic; void (*function)(unsigned long); unsigned long data; - struct tvec_t_base_s *base; + struct tvec_t_base_s *base; /* ==NULL if timer not scheduled */ + u8 running; /* function is currently executing */ + u8 shutdown;/* do not schedule this timer */ }; -#define TIMER_MAGIC0x4b87ad6e - #define TIMER_INITIALIZER(_function, _expires, _data) {\ .function = (_function),\ .expires = (_expires), \ .data = (_data),\ .base = NULL, \ - .magic = TIMER_MAGIC, \ + .running = 0, \ + .shutdown = 0, \ .lock = SPIN_LOCK_UNLOCKED, \ } @@ -42,7 +42,8 @@ struct timer_list { static inline void init_timer(struct timer_list * timer) { timer->base = NULL; - timer->magic = TIMER_MAGIC; + timer->running = 0; + timer->shutdown = 0; spin_lock_init(>lock); } Index: linux-2.6.11/kernel/timer.c === --- linux-2.6.11.orig/kernel/timer.c2005-03-14 14:17:51.671533904 -0800 +++ linux-2.6.11/kernel/timer.c 2005-03-14 14:57:24.613791848 -0800 @@ -89,31 +89,6 @@ static inline void set_running_timer(tve /* Fake initialization */ static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED }; -static void check_timer_failed(struct timer_list *timer) -{ - static int whine_count; - if (whine_count < 16) { - whine_count++; - printk("Uninitialised timer!\n"); - printk("This is just a warning. Your computer is OK\n"); - printk("function=0x%p, data=0x%lx\n", - timer->function, timer->data); - dump_stack(); - } - /* -* Now fix it up -*/ - spin_lock_init(>lock); - timer->magic = TIMER_MAGIC; -} - -static inline void check_timer(struct timer_list *timer) -{ - if (timer->magic != TIMER_MAGIC) - check_timer_failed(timer); -} - - static void internal_add_timer(tvec_base_t *base, struct timer_list *timer) { unsigned long expires = timer->expires; @@ -164,8 +139,6 @@ int __mod_timer(struct timer_list *timer BUG_ON(!timer->function); - check_timer(timer); - spin_lock_irqsave(>lock, flags); new_base = &__get_cpu_var(tvec_bases); repeat: @@ -208,8 +181,10 @@ repeat: ret = 1; } timer->expires = expires; - internal_add_timer(new_base, timer); - timer->base = new_base; + if (!timer->shutdown) { + internal_add_timer(new_base, timer); + timer->base = new_base; + } if (old_base && (new_base != old_base)) spin_unlock(_base->lock); @@ -235,7 +210,8 @@ void add_timer_on(struct timer_list *tim BUG_ON(timer_pending(timer) || !timer->function); - check_timer(timer); + if (timer->shutdown) + return; spin_lock_irqsave(>lock, flags); internal_add_timer(base, timer); @@ -267,8 +243,6 @@ int mod_timer(struct timer_list *timer, { BUG_ON(!timer->function); - check_timer(timer); - /* * This is a common optimization triggered by the * networking code - if the timer is re-modified @@ -298,8 +272,6 @@ int del_timer(struct timer_list *timer) unsigned long flags; tvec_base_t *base; - check_timer(timer); - repeat: base = timer->base; if
Re: [patch] del_timer_sync scalability patch
How about this take on the problem? When a potential periodic timer is deleted through timer_del_sync, all cpus are scanned to determine if the timer is running on that cpu. In a NUMA configuration doing so will cause NUMA interlink traffic which limits the scalability of timers. The following patch removes the magic in the timer_list structure (Andrew suggested that we may not need it anymore) and replaces it with two u8 variables that give us some additional state of the timer running - Set if the timer function is currently executing shutdown- Set if del_timer_sync is running and the timer should not be rescheduled. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.11/include/linux/timer.h === --- linux-2.6.11.orig/include/linux/timer.h 2005-03-14 14:17:51.671533904 -0800 +++ linux-2.6.11/include/linux/timer.h 2005-03-14 14:41:49.640929328 -0800 @@ -13,22 +13,22 @@ struct timer_list { unsigned long expires; spinlock_t lock; - unsigned long magic; void (*function)(unsigned long); unsigned long data; - struct tvec_t_base_s *base; + struct tvec_t_base_s *base; /* ==NULL if timer not scheduled */ + u8 running; /* function is currently executing */ + u8 shutdown;/* do not schedule this timer */ }; -#define TIMER_MAGIC0x4b87ad6e - #define TIMER_INITIALIZER(_function, _expires, _data) {\ .function = (_function),\ .expires = (_expires), \ .data = (_data),\ .base = NULL, \ - .magic = TIMER_MAGIC, \ + .running = 0, \ + .shutdown = 0, \ .lock = SPIN_LOCK_UNLOCKED, \ } @@ -42,7 +42,8 @@ struct timer_list { static inline void init_timer(struct timer_list * timer) { timer-base = NULL; - timer-magic = TIMER_MAGIC; + timer-running = 0; + timer-shutdown = 0; spin_lock_init(timer-lock); } Index: linux-2.6.11/kernel/timer.c === --- linux-2.6.11.orig/kernel/timer.c2005-03-14 14:17:51.671533904 -0800 +++ linux-2.6.11/kernel/timer.c 2005-03-14 14:57:24.613791848 -0800 @@ -89,31 +89,6 @@ static inline void set_running_timer(tve /* Fake initialization */ static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED }; -static void check_timer_failed(struct timer_list *timer) -{ - static int whine_count; - if (whine_count 16) { - whine_count++; - printk(Uninitialised timer!\n); - printk(This is just a warning. Your computer is OK\n); - printk(function=0x%p, data=0x%lx\n, - timer-function, timer-data); - dump_stack(); - } - /* -* Now fix it up -*/ - spin_lock_init(timer-lock); - timer-magic = TIMER_MAGIC; -} - -static inline void check_timer(struct timer_list *timer) -{ - if (timer-magic != TIMER_MAGIC) - check_timer_failed(timer); -} - - static void internal_add_timer(tvec_base_t *base, struct timer_list *timer) { unsigned long expires = timer-expires; @@ -164,8 +139,6 @@ int __mod_timer(struct timer_list *timer BUG_ON(!timer-function); - check_timer(timer); - spin_lock_irqsave(timer-lock, flags); new_base = __get_cpu_var(tvec_bases); repeat: @@ -208,8 +181,10 @@ repeat: ret = 1; } timer-expires = expires; - internal_add_timer(new_base, timer); - timer-base = new_base; + if (!timer-shutdown) { + internal_add_timer(new_base, timer); + timer-base = new_base; + } if (old_base (new_base != old_base)) spin_unlock(old_base-lock); @@ -235,7 +210,8 @@ void add_timer_on(struct timer_list *tim BUG_ON(timer_pending(timer) || !timer-function); - check_timer(timer); + if (timer-shutdown) + return; spin_lock_irqsave(base-lock, flags); internal_add_timer(base, timer); @@ -267,8 +243,6 @@ int mod_timer(struct timer_list *timer, { BUG_ON(!timer-function); - check_timer(timer); - /* * This is a common optimization triggered by the * networking code - if the timer is re-modified @@ -298,8 +272,6 @@ int del_timer(struct timer_list *timer) unsigned long flags; tvec_base_t *base; - check_timer(timer); - repeat: base = timer-base; if (!base) @@ -337,35
Re: [patch] del_timer_sync scalability patch
Christoph Lameter wrote: On Sun, 13 Mar 2005, Oleg Nesterov wrote: I suspect that del_timer_sync() in its current form is racy. ...snip... next timer interrupt, __run_timers() picks this timer again, sets timer-base = NULL ^^^ if (timer_pending(timer)) // no, timer-base == NULL timer-base is != NULL because the timer has rescheduled itself. __mod_timer sets timer-bBase Christoph, please look again. Yes, __mod_timer sets timer-base, but it is cleared in the _next_ timer interrupt on CPU 0. Andrew, Ingo, what do you think? Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Christoph Lameter wrote: @@ -476,6 +454,7 @@ repeat: } } spin_lock_irq(base-lock); + timer-running = 0; ^^ goto repeat; } } This is imho wrong. The timer probably don't exist when timer_list-function returns. I mean, timer_list-function could deallocate timer's memory. Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
* Christoph Lameter [EMAIL PROTECTED] wrote: The following patch removes the magic in the timer_list structure (Andrew suggested that we may not need it anymore) and replaces it with two u8 variables that give us some additional state of the timer The 'remove the magic' observation is not a 'backdoor' to introduce new fields 'for free'. So please dont mix the two things into one patch. + u8 running; /* function is currently executing */ + u8 shutdown;/* do not schedule this timer */ it may as well be cleaner to do the timer-base_running thing after all, and to do this in __run_timers(): timer-base = NULL; timer-base_running = base; note that we cannot clear -base_running in __run_timers() [we dont own any access to the timer anymore, it may be kfree()d, etc.], so del_timer_sync() has to make sure that the timer-base_running info is not stale - it is a hint only. But it looks doable, and it should solve the NUMA scalability problem. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Sun, 13 Mar 2005, Oleg Nesterov wrote: > I suspect that del_timer_sync() in its current form is racy. > > CPU 0 CPU 1 > > __run_timers() sets timer->base = NULL > > del_timer_sync() starts, calls > del_timer(), it returns > because timer->base == NULL. > > waits until the run is complete: > while > (base->running_timer == timer) > > preempt_check_resched(); > > calls schedule(), or long > interrupt happens. > > timer reschedules itself, __run_timers() > exits. > > base->running_timer == NULL, > end of loop. > > next timer interrupt, __run_timers() picks > this timer again, sets timer->base = NULL > > if (timer_pending(timer)) > // no, timer->base == NULL timer->base is != NULL because the timer has rescheduled itself. __mod_timer sets timer->bBase > goto del_again; > // not taken Yes it is taken! > > del_timer_sync() returns > > timer runs. Nope. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Sun, 13 Mar 2005, Oleg Nesterov wrote: I suspect that del_timer_sync() in its current form is racy. CPU 0 CPU 1 __run_timers() sets timer-base = NULL del_timer_sync() starts, calls del_timer(), it returns because timer-base == NULL. waits until the run is complete: while (base-running_timer == timer) preempt_check_resched(); calls schedule(), or long interrupt happens. timer reschedules itself, __run_timers() exits. base-running_timer == NULL, end of loop. next timer interrupt, __run_timers() picks this timer again, sets timer-base = NULL if (timer_pending(timer)) // no, timer-base == NULL timer-base is != NULL because the timer has rescheduled itself. __mod_timer sets timer-bBase goto del_again; // not taken Yes it is taken! del_timer_sync() returns timer runs. Nope. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
I suspect that del_timer_sync() in its current form is racy. CPU 0 CPU 1 __run_timers() sets timer->base = NULL del_timer_sync() starts, calls del_timer(), it returns because timer->base == NULL. waits until the run is complete: while (base->running_timer == timer) preempt_check_resched(); calls schedule(), or long interrupt happens. timer reschedules itself, __run_timers() exits. base->running_timer == NULL, end of loop. next timer interrupt, __run_timers() picks this timer again, sets timer->base = NULL if (timer_pending(timer)) // no, timer->base == NULL goto del_again; // not taken del_timer_sync() returns timer runs. No? Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
I suspect that del_timer_sync() in its current form is racy. CPU 0 CPU 1 __run_timers() sets timer-base = NULL del_timer_sync() starts, calls del_timer(), it returns because timer-base == NULL. waits until the run is complete: while (base-running_timer == timer) preempt_check_resched(); calls schedule(), or long interrupt happens. timer reschedules itself, __run_timers() exits. base-running_timer == NULL, end of loop. next timer interrupt, __run_timers() picks this timer again, sets timer-base = NULL if (timer_pending(timer)) // no, timer-base == NULL goto del_again; // not taken del_timer_sync() returns timer runs. No? Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Fri, 11 Mar 2005, Oleg Nesterov wrote: > I think it is not enough to exchange these 2 lines in > __run_timers, we also need barriers. Maybe its best to drop last_running_timer as Ingo suggested. Replace the magic with a flag that can be set to stop scheduling a timer again. Then del_timer_sync may 1. Set the flag not to reschedule 2. delete the timer 3. wait till the timer function is complete 4. (eventually verify that the timer is really gone) 5. reset the no reschedule flag - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Hello. I am not sure, but I think this patch incorrect. > @@ -466,6 +482,7 @@ repeat: > set_running_timer(base, timer); > smp_wmb(); > timer->base = NULL; --> WINDOW <-- > + set_last_running(timer, base); > spin_unlock_irq(>lock); Suppose it is the first invocation of timer, so timer->last_running == NULL. What if del_timer_sync() happens in that window? del_timer_sync: del_timer();// timer->base == NULL, returns base = timer->last_running; if (base) // no, it is still NULL ... if (timer->base != NULL || timer->last_running != base) goto del_again; // not taken return; I think it is not enough to exchange these 2 lines in __run_timers, we also need barriers. Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Hello. I am not sure, but I think this patch incorrect. @@ -466,6 +482,7 @@ repeat: set_running_timer(base, timer); smp_wmb(); timer-base = NULL; -- WINDOW -- + set_last_running(timer, base); spin_unlock_irq(base-lock); Suppose it is the first invocation of timer, so timer-last_running == NULL. What if del_timer_sync() happens in that window? del_timer_sync: del_timer();// timer-base == NULL, returns base = timer-last_running; if (base) // no, it is still NULL ... if (timer-base != NULL || timer-last_running != base) goto del_again; // not taken return; I think it is not enough to exchange these 2 lines in __run_timers, we also need barriers. Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Fri, 11 Mar 2005, Oleg Nesterov wrote: I think it is not enough to exchange these 2 lines in __run_timers, we also need barriers. Maybe its best to drop last_running_timer as Ingo suggested. Replace the magic with a flag that can be set to stop scheduling a timer again. Then del_timer_sync may 1. Set the flag not to reschedule 2. delete the timer 3. wait till the timer function is complete 4. (eventually verify that the timer is really gone) 5. reset the no reschedule flag - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Tue, 8 Mar 2005, Ingo Molnar wrote: > > > The following patch makes the timer remember where the timer was last > > > started. It is then possible to only wait for the completion of the timer > > > on that specific cpu. > > i'm not sure about this. The patch adds one more pointer to a very > frequently used and frequently embedded data structure (struct > timer_list), for the benefit of a rarely used API variant > (timer_del_sync()). Without that pointer there is no information in the timer_list struct as to where the timer is/was running since base is set to NULL when the timer function is executed. > Furthermore, timer->base itself is affine too, a timer always runs on > the CPU belonging to timer->base. So a more scalable variant of > del_timer_sync() would perhaps be possible by carefully deleting the > timer and only going into the full loop if the timer was deleted before. > (and in which case semantics require us to synchronize on timer > execution.) Or we could skip the full loop altogether and just > synchronize with timer execution if _we_ deleted the timer. This should > work fine as far as itimers are concerned. There is no easy way to distinguish the case that a timer was deleted from the case that the timer function is running (which may reschedule the periodic timer). A scan over all the bases is required to insure that the timer function is not being executed right now. If we only would sync when we deleted the timer (meaning timer->base was != NULL) then we would not synchronize when the timer function is executing (and therefore timer->base == NULL). However, we need to sync exactly in that case because a periodic timer may set timer->base again when scheduling the next event. That is the main stated purpose of del_timer_sync. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Tue, 8 Mar 2005, Andrew Morton wrote: > If we're prepared to rule that a timer handler is not allowed to do > add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't > it? The process may be rescheduled to run on different processor. Then the add_timer() function (called from schedule_next_timer in kernel/posix-timers.c) will also add the timer to the new processor because it is called from the signal handling code. So I think that it is possible that a periodic timer will be scheduled on different processors. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > * Andrew Morton <[EMAIL PROTECTED]> wrote: > > > Christoph Lameter <[EMAIL PROTECTED]> wrote: > > > > > > When a potential periodic timer is deleted through timer_del_sync, all > > > cpus are scanned to determine if the timer is running on that cpu. In a > > > NUMA configuration doing so will cause NUMA interlink traffic which > > > limits > > > the scalability of timers. > > > > > > The following patch makes the timer remember where the timer was last > > > started. It is then possible to only wait for the completion of the timer > > > on that specific cpu. > > i'm not sure about this. The patch adds one more pointer to a very > frequently used and frequently embedded data structure (struct > timer_list), for the benefit of a rarely used API variant > (timer_del_sync()). True. (We can delete timer.magic and check_timer() now. It has served its purpose) > Furthermore, timer->base itself is affine too, a timer always runs on > the CPU belonging to timer->base. So a more scalable variant of > del_timer_sync() would perhaps be possible by carefully deleting the > timer and only going into the full loop if the timer was deleted before. > (and in which case semantics require us to synchronize on timer > execution.) Or we could skip the full loop altogether and just > synchronize with timer execution if _we_ deleted the timer. This should > work fine as far as itimers are concerned. If we're prepared to rule that a timer handler is not allowed to do add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't it? That should make things simpler? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
* Andrew Morton <[EMAIL PROTECTED]> wrote: > Christoph Lameter <[EMAIL PROTECTED]> wrote: > > > > When a potential periodic timer is deleted through timer_del_sync, all > > cpus are scanned to determine if the timer is running on that cpu. In a > > NUMA configuration doing so will cause NUMA interlink traffic which limits > > the scalability of timers. > > > > The following patch makes the timer remember where the timer was last > > started. It is then possible to only wait for the completion of the timer > > on that specific cpu. i'm not sure about this. The patch adds one more pointer to a very frequently used and frequently embedded data structure (struct timer_list), for the benefit of a rarely used API variant (timer_del_sync()). Furthermore, timer->base itself is affine too, a timer always runs on the CPU belonging to timer->base. So a more scalable variant of del_timer_sync() would perhaps be possible by carefully deleting the timer and only going into the full loop if the timer was deleted before. (and in which case semantics require us to synchronize on timer execution.) Or we could skip the full loop altogether and just synchronize with timer execution if _we_ deleted the timer. This should work fine as far as itimers are concerned. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
* Andrew Morton [EMAIL PROTECTED] wrote: Christoph Lameter [EMAIL PROTECTED] wrote: When a potential periodic timer is deleted through timer_del_sync, all cpus are scanned to determine if the timer is running on that cpu. In a NUMA configuration doing so will cause NUMA interlink traffic which limits the scalability of timers. The following patch makes the timer remember where the timer was last started. It is then possible to only wait for the completion of the timer on that specific cpu. i'm not sure about this. The patch adds one more pointer to a very frequently used and frequently embedded data structure (struct timer_list), for the benefit of a rarely used API variant (timer_del_sync()). Furthermore, timer-base itself is affine too, a timer always runs on the CPU belonging to timer-base. So a more scalable variant of del_timer_sync() would perhaps be possible by carefully deleting the timer and only going into the full loop if the timer was deleted before. (and in which case semantics require us to synchronize on timer execution.) Or we could skip the full loop altogether and just synchronize with timer execution if _we_ deleted the timer. This should work fine as far as itimers are concerned. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Ingo Molnar [EMAIL PROTECTED] wrote: * Andrew Morton [EMAIL PROTECTED] wrote: Christoph Lameter [EMAIL PROTECTED] wrote: When a potential periodic timer is deleted through timer_del_sync, all cpus are scanned to determine if the timer is running on that cpu. In a NUMA configuration doing so will cause NUMA interlink traffic which limits the scalability of timers. The following patch makes the timer remember where the timer was last started. It is then possible to only wait for the completion of the timer on that specific cpu. i'm not sure about this. The patch adds one more pointer to a very frequently used and frequently embedded data structure (struct timer_list), for the benefit of a rarely used API variant (timer_del_sync()). True. (We can delete timer.magic and check_timer() now. It has served its purpose) Furthermore, timer-base itself is affine too, a timer always runs on the CPU belonging to timer-base. So a more scalable variant of del_timer_sync() would perhaps be possible by carefully deleting the timer and only going into the full loop if the timer was deleted before. (and in which case semantics require us to synchronize on timer execution.) Or we could skip the full loop altogether and just synchronize with timer execution if _we_ deleted the timer. This should work fine as far as itimers are concerned. If we're prepared to rule that a timer handler is not allowed to do add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't it? That should make things simpler? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Tue, 8 Mar 2005, Andrew Morton wrote: If we're prepared to rule that a timer handler is not allowed to do add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't it? The process may be rescheduled to run on different processor. Then the add_timer() function (called from schedule_next_timer in kernel/posix-timers.c) will also add the timer to the new processor because it is called from the signal handling code. So I think that it is possible that a periodic timer will be scheduled on different processors. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
On Tue, 8 Mar 2005, Ingo Molnar wrote: The following patch makes the timer remember where the timer was last started. It is then possible to only wait for the completion of the timer on that specific cpu. i'm not sure about this. The patch adds one more pointer to a very frequently used and frequently embedded data structure (struct timer_list), for the benefit of a rarely used API variant (timer_del_sync()). Without that pointer there is no information in the timer_list struct as to where the timer is/was running since base is set to NULL when the timer function is executed. Furthermore, timer-base itself is affine too, a timer always runs on the CPU belonging to timer-base. So a more scalable variant of del_timer_sync() would perhaps be possible by carefully deleting the timer and only going into the full loop if the timer was deleted before. (and in which case semantics require us to synchronize on timer execution.) Or we could skip the full loop altogether and just synchronize with timer execution if _we_ deleted the timer. This should work fine as far as itimers are concerned. There is no easy way to distinguish the case that a timer was deleted from the case that the timer function is running (which may reschedule the periodic timer). A scan over all the bases is required to insure that the timer function is not being executed right now. If we only would sync when we deleted the timer (meaning timer-base was != NULL) then we would not synchronize when the timer function is executing (and therefore timer-base == NULL). However, we need to sync exactly in that case because a periodic timer may set timer-base again when scheduling the next event. That is the main stated purpose of del_timer_sync. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Christoph Lameter <[EMAIL PROTECTED]> wrote: > > When a potential periodic timer is deleted through timer_del_sync, all > cpus are scanned to determine if the timer is running on that cpu. In a > NUMA configuration doing so will cause NUMA interlink traffic which limits > the scalability of timers. > > The following patch makes the timer remember where the timer was last > started. It is then possible to only wait for the completion of the timer > on that specific cpu. OK, I stared at this for a while and cannot see holes in it. There may still be one though - this code is tricky. Probably any such holes would be covered up by the fact that timers never migrate between CPUs anyway, unless the handler chooses to do add_timer_on(), which I'm sure none do.. Ingo, could you take a look? Especially what happens if the timer hops CPUs after the /* Get where the timer ran last */ base = timer->last_running; statement. Do we have sufficient barriers in there for this? A few tidyings which I'd suggest: - Remove TIMER_INIT_LASTRUNNING: struct initialisers set unmentioned fields to zero anyway. - Rename set_last_running() to timer_set_last_running, move it to timer.h. Then use it in init_timer() to avoid an ifdef. Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> --- 25-akpm/include/linux/timer.h | 21 ++--- 25-akpm/kernel/timer.c| 10 +- 2 files changed, 11 insertions(+), 20 deletions(-) diff -puN include/linux/timer.h~del_timer_sync-scalability-patch-tidy include/linux/timer.h --- 25/include/linux/timer.h~del_timer_sync-scalability-patch-tidy 2005-03-07 23:28:55.0 -0800 +++ 25-akpm/include/linux/timer.h 2005-03-07 23:30:23.0 -0800 @@ -26,12 +26,6 @@ struct timer_list { #define TIMER_MAGIC0x4b87ad6e -#ifdef CONFIG_SMP -#define TIMER_INIT_LASTRUNNING .last_running = NULL, -#else -#define TIMER_INIT_LASTRUNNING -#endif - #define TIMER_INITIALIZER(_function, _expires, _data) {\ .function = (_function),\ .expires = (_expires), \ @@ -39,9 +33,16 @@ struct timer_list { .base = NULL, \ .magic = TIMER_MAGIC, \ .lock = SPIN_LOCK_UNLOCKED, \ - TIMER_INIT_LASTRUNNING \ } +static inline void +timer_set_last_running(struct timer_list *timer, struct tvec_t_base_s *base) +{ +#ifdef CONFIG_SMP + timer->last_running = base; +#endif +} + /*** * init_timer - initialize a timer. * @timer: the timer to be initialized @@ -49,11 +50,9 @@ struct timer_list { * init_timer() must be done to a timer prior calling *any* of the * other timer functions. */ -static inline void init_timer(struct timer_list * timer) +static inline void init_timer(struct timer_list *timer) { -#ifdef CONFIG_SMP - timer->last_running = NULL; -#endif + timer_set_last_running(timer, NULL); timer->base = NULL; timer->magic = TIMER_MAGIC; spin_lock_init(>lock); diff -puN kernel/timer.c~del_timer_sync-scalability-patch-tidy kernel/timer.c --- 25/kernel/timer.c~del_timer_sync-scalability-patch-tidy 2005-03-07 23:28:55.0 -0800 +++ 25-akpm/kernel/timer.c 2005-03-07 23:28:55.0 -0800 @@ -86,14 +86,6 @@ static inline void set_running_timer(tve #endif } -static inline void set_last_running(struct timer_list *timer, - tvec_base_t *base) -{ -#ifdef CONFIG_SMP - timer->last_running = base; -#endif -} - /* Fake initialization */ static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED }; @@ -482,7 +474,7 @@ repeat: set_running_timer(base, timer); smp_wmb(); timer->base = NULL; - set_last_running(timer, base); + timer_set_last_running(timer, base); spin_unlock_irq(>lock); { u32 preempt_count = preempt_count(); _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] del_timer_sync scalability patch
Christoph Lameter [EMAIL PROTECTED] wrote: When a potential periodic timer is deleted through timer_del_sync, all cpus are scanned to determine if the timer is running on that cpu. In a NUMA configuration doing so will cause NUMA interlink traffic which limits the scalability of timers. The following patch makes the timer remember where the timer was last started. It is then possible to only wait for the completion of the timer on that specific cpu. OK, I stared at this for a while and cannot see holes in it. There may still be one though - this code is tricky. Probably any such holes would be covered up by the fact that timers never migrate between CPUs anyway, unless the handler chooses to do add_timer_on(), which I'm sure none do.. Ingo, could you take a look? Especially what happens if the timer hops CPUs after the /* Get where the timer ran last */ base = timer-last_running; statement. Do we have sufficient barriers in there for this? A few tidyings which I'd suggest: - Remove TIMER_INIT_LASTRUNNING: struct initialisers set unmentioned fields to zero anyway. - Rename set_last_running() to timer_set_last_running, move it to timer.h. Then use it in init_timer() to avoid an ifdef. Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- 25-akpm/include/linux/timer.h | 21 ++--- 25-akpm/kernel/timer.c| 10 +- 2 files changed, 11 insertions(+), 20 deletions(-) diff -puN include/linux/timer.h~del_timer_sync-scalability-patch-tidy include/linux/timer.h --- 25/include/linux/timer.h~del_timer_sync-scalability-patch-tidy 2005-03-07 23:28:55.0 -0800 +++ 25-akpm/include/linux/timer.h 2005-03-07 23:30:23.0 -0800 @@ -26,12 +26,6 @@ struct timer_list { #define TIMER_MAGIC0x4b87ad6e -#ifdef CONFIG_SMP -#define TIMER_INIT_LASTRUNNING .last_running = NULL, -#else -#define TIMER_INIT_LASTRUNNING -#endif - #define TIMER_INITIALIZER(_function, _expires, _data) {\ .function = (_function),\ .expires = (_expires), \ @@ -39,9 +33,16 @@ struct timer_list { .base = NULL, \ .magic = TIMER_MAGIC, \ .lock = SPIN_LOCK_UNLOCKED, \ - TIMER_INIT_LASTRUNNING \ } +static inline void +timer_set_last_running(struct timer_list *timer, struct tvec_t_base_s *base) +{ +#ifdef CONFIG_SMP + timer-last_running = base; +#endif +} + /*** * init_timer - initialize a timer. * @timer: the timer to be initialized @@ -49,11 +50,9 @@ struct timer_list { * init_timer() must be done to a timer prior calling *any* of the * other timer functions. */ -static inline void init_timer(struct timer_list * timer) +static inline void init_timer(struct timer_list *timer) { -#ifdef CONFIG_SMP - timer-last_running = NULL; -#endif + timer_set_last_running(timer, NULL); timer-base = NULL; timer-magic = TIMER_MAGIC; spin_lock_init(timer-lock); diff -puN kernel/timer.c~del_timer_sync-scalability-patch-tidy kernel/timer.c --- 25/kernel/timer.c~del_timer_sync-scalability-patch-tidy 2005-03-07 23:28:55.0 -0800 +++ 25-akpm/kernel/timer.c 2005-03-07 23:28:55.0 -0800 @@ -86,14 +86,6 @@ static inline void set_running_timer(tve #endif } -static inline void set_last_running(struct timer_list *timer, - tvec_base_t *base) -{ -#ifdef CONFIG_SMP - timer-last_running = base; -#endif -} - /* Fake initialization */ static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED }; @@ -482,7 +474,7 @@ repeat: set_running_timer(base, timer); smp_wmb(); timer-base = NULL; - set_last_running(timer, base); + timer_set_last_running(timer, base); spin_unlock_irq(base-lock); { u32 preempt_count = preempt_count(); _ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/