from:"Nigel Gamble"

Re: reschedule_idle changes in ac kernels

2001-06-04 Thread Nigel Gamble

On Mon, 4 Jun 2001, Mike Kravetz wrote:
> I just noticed the changes to reschedule_idle() in the 2.4.5-ac
> kernel.  I suspect these are the changes made for:
> 
> o   Fix off by one on real time pre-emption in scheduler
> 
> I'm curious if anyone has ran any benchmarks before and after
> applying this fix.

I was running realtime benchmarks, which was how I found the bug.

> The reason I ask is that during the development of my multi-queue
> scheduler, I 'accidently' changed reschedule_idle code to trigger
> a preemption if preemption_goodness() was greater than 0, as
> opposed to greater than 1.  I believe this is the same change made
> to the ac kernel.  After this change, we saw a noticeable drop in
> performance for some benchmarks.
> 
> The drop in performance I saw could have been the result of a
> combination of the change, and my multi-queue scheduler.  However,
> in any case aren't we now going to trigger more preemptions?
> 
> I understand that we need to make the fig to get the realtime
> semantics correct,  but we also need to be aware of performance in
> the non-realtime case.

The realtime bug was caused by whoever decided, sometime in 2.4, that
the result of preemption_goodness() should be compared to 1 instead of 0
(without changing the comment above that function).

An alternative fix for the realtime bug would be 

weight = 1000 + (p->rt_priority * 2);

in goodness(), so that two realtime tasks with priorities that differ by
1 would have goodness values that differ by more than one.

However, before anyone rushes to implement this, I'd like to suggest
that any performance problems that may be found with the SCHED_OTHER
goodness calculation should be fixed in goodness(), if at all possible,
and not leak out as an undocumented magic number into reschedule_idle().

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: reschedule_idle changes in ac kernels

2001-06-04 Thread Nigel Gamble


On Mon, 4 Jun 2001, Mike Kravetz wrote:
 I just noticed the changes to reschedule_idle() in the 2.4.5-ac
 kernel.  I suspect these are the changes made for:
 
 o   Fix off by one on real time pre-emption in scheduler
 
 I'm curious if anyone has ran any benchmarks before and after
 applying this fix.

I was running realtime benchmarks, which was how I found the bug.

 The reason I ask is that during the development of my multi-queue
 scheduler, I 'accidently' changed reschedule_idle code to trigger
 a preemption if preemption_goodness() was greater than 0, as
 opposed to greater than 1.  I believe this is the same change made
 to the ac kernel.  After this change, we saw a noticeable drop in
 performance for some benchmarks.
 
 The drop in performance I saw could have been the result of a
 combination of the change, and my multi-queue scheduler.  However,
 in any case aren't we now going to trigger more preemptions?
 
 I understand that we need to make the fig to get the realtime
 semantics correct,  but we also need to be aware of performance in
 the non-realtime case.

The realtime bug was caused by whoever decided, sometime in 2.4, that
the result of preemption_goodness() should be compared to 1 instead of 0
(without changing the comment above that function).

An alternative fix for the realtime bug would be 

weight = 1000 + (p-rt_priority * 2);

in goodness(), so that two realtime tasks with priorities that differ by
1 would have goodness values that differ by more than one.

However, before anyone rushes to implement this, I'd like to suggest
that any performance problems that may be found with the SCHED_OTHER
goodness calculation should be fixed in goodness(), if at all possible,
and not leak out as an undocumented magic number into reschedule_idle().

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86 page fault handler not interrupt safe

2001-05-07 Thread Nigel Gamble


On Mon, 7 May 2001, Brian Gerst wrote:
> Nigel Gamble wrote:
> > 
> > On Mon, 7 May 2001, Linus Torvalds wrote:
> > > On Mon, 7 May 2001, Brian Gerst wrote:
> > > > This patch will still cause the user process to seg fault: The error
> > > > code on the stack will not match the address in %cr2.
> > >
> > > You've convinced me. Good thinking. Let's do the irq thing.
> > 
> > I've actually seen user processes seg faulting because of this with the
> > fully preemptible kernel patch applied.  The fix we used in that patch
> > was to use an interrupt gate for the fault handler, then to simply
> > restore the interrupt state:
> 
> Keep in mind that regs->eflags could be from user space, and could have
> some undesirable flags set.  That's why I did a test/sti instead of

Good point.

> reloading eflags.  Plus my patch leaves interrupts disabled for the
> minimum time possible.

I'm not sure that it makes much difference, as interrupts are disabled
for such a short time anyway.  I'd prefer to put the test/sti in
do_page_fault(), and reduce the complexity needed in assembler routines
as much as possible, for maintainability reasons.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86 page fault handler not interrupt safe

2001-05-07 Thread Nigel Gamble


On Mon, 7 May 2001, Linus Torvalds wrote:
> On Mon, 7 May 2001, Brian Gerst wrote:
> > This patch will still cause the user process to seg fault: The error
> > code on the stack will not match the address in %cr2.
> 
> You've convinced me. Good thinking. Let's do the irq thing.

I've actually seen user processes seg faulting because of this with the
fully preemptible kernel patch applied.  The fix we used in that patch
was to use an interrupt gate for the fault handler, then to simply
restore the interrupt state:

diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
--- 2.4.2/arch/i386/kernel/traps.c  Mon Mar 26 18:41:05 2001
+++ linux/arch/i386/kernel/traps.c  Tue Mar 27 15:13:33 2001
@@ -973,7 +973,7 @@
set_trap_gate(11,_not_present);
set_trap_gate(12,_segment);
set_trap_gate(13,_protection);
-   set_trap_gate(14,_fault);
+   set_intr_gate(14,_fault);
set_trap_gate(15,_interrupt_bug);
set_trap_gate(16,_error);
set_trap_gate(17,_check);
diff -Nur 2.4.2/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c
--- 2.4.2/arch/i386/mm/fault.c  Mon Mar 26 18:41:06 2001
+++ linux/arch/i386/mm/fault.c  Tue Mar 27 15:13:33 2001
@@ -117,6 +117,9 @@
/* get the address */
__asm__("movl %%cr2,%0":"=r" (address));
 
+   /* It's safe to allow preemption after cr2 has been saved */
+   local_irq_restore(regs->eflags);
+
    tsk = current;
 
/*

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86 page fault handler not interrupt safe

2001-05-07 Thread Nigel Gamble


On Mon, 7 May 2001, Linus Torvalds wrote:
 On Mon, 7 May 2001, Brian Gerst wrote:
  This patch will still cause the user process to seg fault: The error
  code on the stack will not match the address in %cr2.
 
 You've convinced me. Good thinking. Let's do the irq thing.

I've actually seen user processes seg faulting because of this with the
fully preemptible kernel patch applied.  The fix we used in that patch
was to use an interrupt gate for the fault handler, then to simply
restore the interrupt state:

diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
--- 2.4.2/arch/i386/kernel/traps.c  Mon Mar 26 18:41:05 2001
+++ linux/arch/i386/kernel/traps.c  Tue Mar 27 15:13:33 2001
@@ -973,7 +973,7 @@
set_trap_gate(11,segment_not_present);
set_trap_gate(12,stack_segment);
set_trap_gate(13,general_protection);
-   set_trap_gate(14,page_fault);
+   set_intr_gate(14,page_fault);
set_trap_gate(15,spurious_interrupt_bug);
set_trap_gate(16,coprocessor_error);
set_trap_gate(17,alignment_check);
diff -Nur 2.4.2/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c
--- 2.4.2/arch/i386/mm/fault.c  Mon Mar 26 18:41:06 2001
+++ linux/arch/i386/mm/fault.c  Tue Mar 27 15:13:33 2001
@@ -117,6 +117,9 @@
/* get the address */
__asm__(movl %%cr2,%0:=r (address));
 
+   /* It's safe to allow preemption after cr2 has been saved */
+   local_irq_restore(regs-eflags);
+
tsk = current;
 
/*

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86 page fault handler not interrupt safe

2001-05-07 Thread Nigel Gamble


On Mon, 7 May 2001, Brian Gerst wrote:
 Nigel Gamble wrote:
  
  On Mon, 7 May 2001, Linus Torvalds wrote:
   On Mon, 7 May 2001, Brian Gerst wrote:
This patch will still cause the user process to seg fault: The error
code on the stack will not match the address in %cr2.
  
   You've convinced me. Good thinking. Let's do the irq thing.
  
  I've actually seen user processes seg faulting because of this with the
  fully preemptible kernel patch applied.  The fix we used in that patch
  was to use an interrupt gate for the fault handler, then to simply
  restore the interrupt state:
 
 Keep in mind that regs-eflags could be from user space, and could have
 some undesirable flags set.  That's why I did a test/sti instead of

Good point.

 reloading eflags.  Plus my patch leaves interrupts disabled for the
 minimum time possible.

I'm not sure that it makes much difference, as interrupts are disabled
for such a short time anyway.  I'd prefer to put the test/sti in
do_page_fault(), and reduce the complexity needed in assembler routines
as much as possible, for maintainability reasons.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: #define HZ 1024 -- negative effects?

2001-04-27 Thread Nigel Gamble

On Fri, 27 Apr 2001, Mike Galbraith wrote:
> On Fri, 27 Apr 2001, Nigel Gamble wrote:
> > > What about SCHED_YIELD and allocating during vm stress times?
> 
> snip
> 
> > A well-written GUI should not be using SCHED_YIELD.  If it is
> 
> I was refering to the gui (or other tasks) allocating memory during
> vm stress periods, and running into the yield in __alloc_pages()..
> not a voluntary yield.

Oh, I see.  Well, if this were causing the problem, then running the GUI
at a real-time priority would be a better solution than increasing the
clock frequency, since SCHED_YIELD has no effect on real-time tasks
unless there are other runnable real-time tasks at the same priority.
The call to schedule() would just reschedule the real-time GUI task
itself immediately.

However, in times of vm stress it is more likely that GUI performance
problems would be caused by parts of the GUI having been paged out,
rather than by anything which could be helped by scheduling differences.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: #define HZ 1024 -- negative effects?

2001-04-27 Thread Nigel Gamble

On Fri, 27 Apr 2001, Mike Galbraith wrote:
> > Rubbish.  Whenever a higher-priority thread than the current
> > thread becomes runnable the current thread will get preempted,
> > regardless of whether its timeslices is over or not.
> 
> What about SCHED_YIELD and allocating during vm stress times?
> 
> Say you have only two tasks.  One is the gui and is allocating,
> the other is a pure compute task.  The compute task doesn't do
> anything which will cause preemtion except use up it's slice.
> The gui may yield the cpu but the compute job never will.
> 
> (The gui won't _become_ runnable if that matters.  It's marked
> as running, has yielded it's remaining slice and went to sleep..
> with it's eyes open;)

A well-written GUI should not be using SCHED_YIELD.  If it is
"allocating", anything, it won't be using SCHED_YIELD or be marked
runnable, it will be blocked, waiting until the resource becomes
available.  When that happens, it will preempt the compute task (if its
priority is high enough, which is very likely - and can be assured if
it's running at a real-time priority as I suggested earlier).

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: #define HZ 1024 -- negative effects?

2001-04-27 Thread Nigel Gamble


On Fri, 27 Apr 2001, Mike Galbraith wrote:
  Rubbish.  Whenever a higher-priority thread than the current
  thread becomes runnable the current thread will get preempted,
  regardless of whether its timeslices is over or not.
 
 What about SCHED_YIELD and allocating during vm stress times?
 
 Say you have only two tasks.  One is the gui and is allocating,
 the other is a pure compute task.  The compute task doesn't do
 anything which will cause preemtion except use up it's slice.
 The gui may yield the cpu but the compute job never will.
 
 (The gui won't _become_ runnable if that matters.  It's marked
 as running, has yielded it's remaining slice and went to sleep..
 with it's eyes open;)

A well-written GUI should not be using SCHED_YIELD.  If it is
allocating, anything, it won't be using SCHED_YIELD or be marked
runnable, it will be blocked, waiting until the resource becomes
available.  When that happens, it will preempt the compute task (if its
priority is high enough, which is very likely - and can be assured if
it's running at a real-time priority as I suggested earlier).

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: #define HZ 1024 -- negative effects?

2001-04-27 Thread Nigel Gamble


On Fri, 27 Apr 2001, Mike Galbraith wrote:
 On Fri, 27 Apr 2001, Nigel Gamble wrote:
   What about SCHED_YIELD and allocating during vm stress times?
 
 snip
 
  A well-written GUI should not be using SCHED_YIELD.  If it is
 
 I was refering to the gui (or other tasks) allocating memory during
 vm stress periods, and running into the yield in __alloc_pages()..
 not a voluntary yield.

Oh, I see.  Well, if this were causing the problem, then running the GUI
at a real-time priority would be a better solution than increasing the
clock frequency, since SCHED_YIELD has no effect on real-time tasks
unless there are other runnable real-time tasks at the same priority.
The call to schedule() would just reschedule the real-time GUI task
itself immediately.

However, in times of vm stress it is more likely that GUI performance
problems would be caused by parts of the GUI having been paged out,
rather than by anything which could be helped by scheduling differences.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Viewing SCHED_FIFO, SCHED_RR stats in /proc

2001-04-26 Thread Nigel Gamble


I've just noticed that the priority and nice values listed in
/proc//stat aren't very useful for SCHED_FIFO or SCHED_RR tasks.
I'd like to be able to distinguish tasks with these policies from
SCHED_OTHER tasks, and to view task->rt_priority.

Am I correct that this information is not currently available through
/proc?

Here is one way to expose this information that should be compatible
with existing tools like top and ps.  For SCHED_OTHER, the values are
unchanged.  For SCHED_RR and SCHED_FIFO, the priority value displayed is
(20 + task->rt_priority), which distinguishes them for SCHED_OTHER
priorities which can't be greater than 20.  And SCHED_FIFO tasks, whose
nice value is ignored by the scheduler, are distinguished from SCHED_RR
tasks by begin displayed with a nice value of -99.

diff -u -r1.2 array.c
--- linux/fs/proc/array.c   2001/04/16 23:26:41 1.2
+++ linux/fs/proc/array.c   2001/04/26 22:37:56
@@ -336,11 +336,18 @@
 
collect_sigign_sigcatch(task, , );
 
-   /* scale priority and nice values from timeslices to -20..20 */
-   /* to make it look like a "normal" Unix priority/nice value  */
-   priority = task->counter;
-   priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER;
-   nice = task->nice;
+   if (task->policy == SCHED_OTHER) {
+   /* scale priority and nice values from timeslices to -20..20 */
+   /* to make it look like a "normal" Unix priority/nice value  */
+   priority = task->counter;
+   priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER;
+   } else {
+   priority = 20 + task->rt_priority;
+   }
+   if (task->policy == SCHED_FIFO)
+   nice = -99;
+   else
+   nice = task->nice;
 
read_lock(_lock);
ppid = task->p_opptr->pid;

Can anyone think of a better way of doing this?

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Viewing SCHED_FIFO, SCHED_RR stats in /proc

2001-04-26 Thread Nigel Gamble


I've just noticed that the priority and nice values listed in
/proc/pid/stat aren't very useful for SCHED_FIFO or SCHED_RR tasks.
I'd like to be able to distinguish tasks with these policies from
SCHED_OTHER tasks, and to view task-rt_priority.

Am I correct that this information is not currently available through
/proc?

Here is one way to expose this information that should be compatible
with existing tools like top and ps.  For SCHED_OTHER, the values are
unchanged.  For SCHED_RR and SCHED_FIFO, the priority value displayed is
(20 + task-rt_priority), which distinguishes them for SCHED_OTHER
priorities which can't be greater than 20.  And SCHED_FIFO tasks, whose
nice value is ignored by the scheduler, are distinguished from SCHED_RR
tasks by begin displayed with a nice value of -99.

diff -u -r1.2 array.c
--- linux/fs/proc/array.c   2001/04/16 23:26:41 1.2
+++ linux/fs/proc/array.c   2001/04/26 22:37:56
@@ -336,11 +336,18 @@
 
collect_sigign_sigcatch(task, sigign, sigcatch);
 
-   /* scale priority and nice values from timeslices to -20..20 */
-   /* to make it look like a normal Unix priority/nice value  */
-   priority = task-counter;
-   priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER;
-   nice = task-nice;
+   if (task-policy == SCHED_OTHER) {
+   /* scale priority and nice values from timeslices to -20..20 */
+   /* to make it look like a normal Unix priority/nice value  */
+   priority = task-counter;
+   priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER;
+   } else {
+   priority = 20 + task-rt_priority;
+   }
+   if (task-policy == SCHED_FIFO)
+   nice = -99;
+   else
+   nice = task-nice;
 
read_lock(tasklist_lock);
ppid = task-p_opptr-pid;

Can anyone think of a better way of doing this?

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: #define HZ 1024 -- negative effects?

2001-04-25 Thread Nigel Gamble

On Tue, 24 Apr 2001, Michael Rothwell wrote:
> Are there any negative effects of editing include/asm/param.h to change 
> HZ from 100 to 1024? Or any other number? This has been suggested as a 
> way to improve the responsiveness of the GUI on a Linux system. Does it 
> throw off anything else, like serial port timing, etc.?

Why not just run the X server at a realtime priority?  Then it will get
to respond to existing events, such as keyboard and mouse input,
promptly without creating lots of superfluous extra clock interrupts.
I think you will find this is a better solution.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Scheduling bug for SCHED_FIFO and SCHED_RR

2001-04-24 Thread Nigel Gamble


On Fri, 20 Apr 2001, Nigel Gamble wrote:
> A SCHED_FIFO or SCHED_RR task with priority n+1 will not preempt a
> running task with priority n.  You need to give the higher priority task
> a priority of at least n+2 for it to be chosen by the scheduler.
> 
> The problem is caused by reschedule_idle(), uniprocessor version:
> 
>   if (preemption_goodness(tsk, p, this_cpu) > 1)
>   tsk->need_resched = 1;
> 
> For real-time scheduling to work correctly, need_resched should be set
> whenever preemption_goodness() is greater than 0, not 1.

This bug is also in the SMP version of reschedule_idle().  The
corresponding fix (against 2.4.3-ac14) is:

--- 2.4.3-ac14/kernel/sched.c   Tue Apr 24 18:40:15 2001
+++ linux/kernel/sched.cTue Apr 24 18:41:32 2001
@@ -246,7 +246,7 @@
 */
oldest_idle = (cycles_t) -1;
target_tsk = NULL;
-   max_prio = 1;
+   max_prio = 0;
 
for (i = 0; i < smp_num_cpus; i++) {
        cpu = cpu_logical_map(i);

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Scheduling bug for SCHED_FIFO and SCHED_RR

2001-04-24 Thread Nigel Gamble


On Fri, 20 Apr 2001, Nigel Gamble wrote:
 A SCHED_FIFO or SCHED_RR task with priority n+1 will not preempt a
 running task with priority n.  You need to give the higher priority task
 a priority of at least n+2 for it to be chosen by the scheduler.
 
 The problem is caused by reschedule_idle(), uniprocessor version:
 
   if (preemption_goodness(tsk, p, this_cpu)  1)
   tsk-need_resched = 1;
 
 For real-time scheduling to work correctly, need_resched should be set
 whenever preemption_goodness() is greater than 0, not 1.

This bug is also in the SMP version of reschedule_idle().  The
corresponding fix (against 2.4.3-ac14) is:

--- 2.4.3-ac14/kernel/sched.c   Tue Apr 24 18:40:15 2001
+++ linux/kernel/sched.cTue Apr 24 18:41:32 2001
@@ -246,7 +246,7 @@
 */
oldest_idle = (cycles_t) -1;
target_tsk = NULL;
-   max_prio = 1;
+   max_prio = 0;
 
for (i = 0; i  smp_num_cpus; i++) {
cpu = cpu_logical_map(i);

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.4.3+ sound distortion

2001-04-21 Thread Nigel Gamble

On Sat, 21 Apr 2001, Victor Julien wrote:
> I have a problem with kernels higher than 2.4.2, the sound distorts when 
> playing a song with xmms while the seti@home client runs. 2.4.2 did not have 
> this problem. I tried 2.4.3, 2.4.4-pre5 and 2.4.3-ac11. They al showed the 
> same problem.

Try running xmms as root with the "Use realtime priority when available"
option checked.  If the distortion is because xmms isn't getting enough
CPU time, then running it at a realtime priority will fix it.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.4.3+ sound distortion

2001-04-21 Thread Nigel Gamble


On Sat, 21 Apr 2001, Victor Julien wrote:
 I have a problem with kernels higher than 2.4.2, the sound distorts when 
 playing a song with xmms while the seti@home client runs. 2.4.2 did not have 
 this problem. I tried 2.4.3, 2.4.4-pre5 and 2.4.3-ac11. They al showed the 
 same problem.

Try running xmms as root with the "Use realtime priority when available"
option checked.  If the distortion is because xmms isn't getting enough
CPU time, then running it at a realtime priority will fix it.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Scheduling bug for SCHED_FIFO and SCHED_RR

2001-04-20 Thread Nigel Gamble


A SCHED_FIFO or SCHED_RR task with priority n+1 will not preempt a
running task with priority n.  You need to give the higher priority task
a priority of at least n+2 for it to be chosen by the scheduler.

The problem is caused by reschedule_idle(), uniprocessor version:

if (preemption_goodness(tsk, p, this_cpu) > 1)
tsk->need_resched = 1;

For real-time scheduling to work correctly, need_resched should be set
whenever preemption_goodness() is greater than 0, not 1.

Here is a patch against 2.4.3:

--- 2.4.3/kernel/sched.cThu Apr 19 15:03:21 2001
+++ linux/kernel/sched.cFri Apr 20 16:45:07 2001
@@ -290,7 +290,7 @@
struct task_struct *tsk;
 
tsk = cpu_curr(this_cpu);
-   if (preemption_goodness(tsk, p, this_cpu) > 1)
+   if (preemption_goodness(tsk, p, this_cpu) > 0)
tsk->need_resched = 1;
 #endif
 }

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Scheduling bug for SCHED_FIFO and SCHED_RR

2001-04-20 Thread Nigel Gamble


A SCHED_FIFO or SCHED_RR task with priority n+1 will not preempt a
running task with priority n.  You need to give the higher priority task
a priority of at least n+2 for it to be chosen by the scheduler.

The problem is caused by reschedule_idle(), uniprocessor version:

if (preemption_goodness(tsk, p, this_cpu)  1)
tsk-need_resched = 1;

For real-time scheduling to work correctly, need_resched should be set
whenever preemption_goodness() is greater than 0, not 1.

Here is a patch against 2.4.3:

--- 2.4.3/kernel/sched.cThu Apr 19 15:03:21 2001
+++ linux/kernel/sched.cFri Apr 20 16:45:07 2001
@@ -290,7 +290,7 @@
struct task_struct *tsk;
 
tsk = cpu_curr(this_cpu);
-   if (preemption_goodness(tsk, p, this_cpu)  1)
+   if (preemption_goodness(tsk, p, this_cpu)  0)
tsk-need_resched = 1;
 #endif
 }

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-11 Thread Nigel Gamble


On Tue, 10 Apr 2001 [EMAIL PROTECTED] wrote:
> On Tue, Apr 10, 2001 at 09:08:16PM -0700, Paul McKenney wrote:
> > > Disabling preemption is a possible solution if the critical section is
> > short
> > > - less than 100us - otherwise preemption latencies become a problem.
> > 
> > Seems like a reasonable restriction.  Of course, this same limit applies
> > to locks and interrupt disabling, right?
> 
> So supposing 1/2 us per update
>   lock process list
>   for every process update pgd
>   unlock process list
> 
> is ok if #processes <  200, but can cause some unspecified system failure
> due to a dependency on the 100us limit otherwise?

Only to a hard real-time system.

> And on a slower machine or with some heavy I/O possibilities 

I'm mostly interested in Linux in embedded systems, where we have a lot
of control over the overall system, such as how many processes are
running.  This makes it easier to control latencies than on a
general purpose computer.

> We have a tiny little kernel to worry about inRTLinux and it's quite 
> hard for us to keep track of all possible delays in such cases. How's this
> going to work for Linux?

The same way everything works for Linux:  with enough people around the
world interested in and working on these problems, they will be fixed.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-11 Thread Nigel Gamble

On Tue, 10 Apr 2001, Paul McKenney wrote:
> > Disabling preemption is a possible solution if the critical section
> > is
> short
> > - less than 100us - otherwise preemption latencies become a problem.
> 
> Seems like a reasonable restriction.  Of course, this same limit
> applies to locks and interrupt disabling, right?

That's the goal I'd like to see us achieve in 2.5.  Interrupts are
already in this range (with a few notable exceptions), but there is
still the big kernel lock and a few other long held spin locks to deal
with.  So I want to make sure that any new locking scheme like the ones
under discussion play nicely with the efforts to achieve low-latency
Linux such as the preemptible kernel.

> > The implementation of synchronize_kernel() that Rusty and I
> > discussed earlier in this thread would work in other cases, such as
> > module unloading, where there was a concern that it was not
> > practical to have any sort of lock in the read-side code path and
> > the write side was not time critical.
> 
> True, but only if the synchronize_kernel() implementation is applied
> to UP kernels, also.

Yes, that is the idea.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-11 Thread Nigel Gamble


On Tue, 10 Apr 2001, Paul McKenney wrote:
  Disabling preemption is a possible solution if the critical section
  is
 short
  - less than 100us - otherwise preemption latencies become a problem.
 
 Seems like a reasonable restriction.  Of course, this same limit
 applies to locks and interrupt disabling, right?

That's the goal I'd like to see us achieve in 2.5.  Interrupts are
already in this range (with a few notable exceptions), but there is
still the big kernel lock and a few other long held spin locks to deal
with.  So I want to make sure that any new locking scheme like the ones
under discussion play nicely with the efforts to achieve low-latency
Linux such as the preemptible kernel.

  The implementation of synchronize_kernel() that Rusty and I
  discussed earlier in this thread would work in other cases, such as
  module unloading, where there was a concern that it was not
  practical to have any sort of lock in the read-side code path and
  the write side was not time critical.
 
 True, but only if the synchronize_kernel() implementation is applied
 to UP kernels, also.

Yes, that is the idea.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-11 Thread Nigel Gamble


On Tue, 10 Apr 2001 [EMAIL PROTECTED] wrote:
 On Tue, Apr 10, 2001 at 09:08:16PM -0700, Paul McKenney wrote:
   Disabling preemption is a possible solution if the critical section is
  short
   - less than 100us - otherwise preemption latencies become a problem.
  
  Seems like a reasonable restriction.  Of course, this same limit applies
  to locks and interrupt disabling, right?
 
 So supposing 1/2 us per update
   lock process list
   for every process update pgd
   unlock process list
 
 is ok if #processes   200, but can cause some unspecified system failure
 due to a dependency on the 100us limit otherwise?

Only to a hard real-time system.

 And on a slower machine or with some heavy I/O possibilities 

I'm mostly interested in Linux in embedded systems, where we have a lot
of control over the overall system, such as how many processes are
running.  This makes it easier to control latencies than on a
general purpose computer.

 We have a tiny little kernel to worry about inRTLinux and it's quite 
 hard for us to keep track of all possible delays in such cases. How's this
 going to work for Linux?

The same way everything works for Linux:  with enough people around the
world interested in and working on these problems, they will be fixed.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-10 Thread Nigel Gamble


On Tue, 10 Apr 2001, Paul McKenney wrote:
> The algorithms we have been looking at need to have absolute guarantees
> that earlier activity has completed.  The most straightforward way to
> guarantee this is to have the critical-section activity run with preemption
> disabled.  Most of these code segments either take out locks or run
> with interrupts disabled anyway, so there is little or no degradation of
> latency in this case.  In fact, in many cases, latency would actually be
> improved due to removal of explicit locking primitives.
>
> I believe that one of the issues that pushes in this direction is the
> discovery that "synchronize_kernel()" could not be a nop in a UP kernel
> unless the read-side critical sections disable preemption (either in
> the natural course of events, or artificially if need be).  Andi or
> Rusty can correct me if I missed something in the previous exchange...
> 
> The read-side code segments are almost always quite short, and, again,
> they would almost always otherwise need to be protected by a lock of
> some sort, which would disable preemption in any event.
> 
> Thoughts?

Disabling preemption is a possible solution if the critical section is short
- less than 100us - otherwise preemption latencies become a problem.

The implementation of synchronize_kernel() that Rusty and I discussed
earlier in this thread would work in other cases, such as module
unloading, where there was a concern that it was not practical to have
any sort of lock in the read-side code path and the write side was not
time critical.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-10 Thread Nigel Gamble


On Tue, 10 Apr 2001, Paul McKenney wrote:
 The algorithms we have been looking at need to have absolute guarantees
 that earlier activity has completed.  The most straightforward way to
 guarantee this is to have the critical-section activity run with preemption
 disabled.  Most of these code segments either take out locks or run
 with interrupts disabled anyway, so there is little or no degradation of
 latency in this case.  In fact, in many cases, latency would actually be
 improved due to removal of explicit locking primitives.

 I believe that one of the issues that pushes in this direction is the
 discovery that "synchronize_kernel()" could not be a nop in a UP kernel
 unless the read-side critical sections disable preemption (either in
 the natural course of events, or artificially if need be).  Andi or
 Rusty can correct me if I missed something in the previous exchange...
 
 The read-side code segments are almost always quite short, and, again,
 they would almost always otherwise need to be protected by a lock of
 some sort, which would disable preemption in any event.
 
 Thoughts?

Disabling preemption is a possible solution if the critical section is short
- less than 100us - otherwise preemption latencies become a problem.

The implementation of synchronize_kernel() that Rusty and I discussed
earlier in this thread would work in other cases, such as module
unloading, where there was a concern that it was not practical to have
any sort of lock in the read-side code path and the write side was not
time critical.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-09 Thread Nigel Gamble

On Mon, 9 Apr 2001 [EMAIL PROTECTED] wrote:
> As you've observed, with the approach of waiting for all pre-empted tasks
> to synchronize, the possibility of a task staying pre-empted for a long
> time could affect the latency of an update/synchonize (though its hard for
> me to judge how likely that is).

It's very unlikely on a system that doesn't already have problems with
CPU starvation because of runaway real-time tasks or interrupt handlers.

First, preemption is a comparitively rare event with a mostly
timesharing load, typically from 1% to 10% of all context switches.

Second, the scheduler should not penalize the preempted task for being
preempted, so that it should usually get to continue running as soon as
the preempting task is descheduled, which is at most one timeslice for
timesharing tasks.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/
MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel

2001-04-09 Thread Nigel Gamble


On Mon, 9 Apr 2001 [EMAIL PROTECTED] wrote:
 As you've observed, with the approach of waiting for all pre-empted tasks
 to synchronize, the possibility of a task staying pre-empted for a long
 time could affect the latency of an update/synchonize (though its hard for
 me to judge how likely that is).

It's very unlikely on a system that doesn't already have problems with
CPU starvation because of runaway real-time tasks or interrupt handlers.

First, preemption is a comparitively rare event with a mostly
timesharing load, typically from 1% to 10% of all context switches.

Second, the scheduler should not penalize the preempted task for being
preempted, so that it should usually get to continue running as soon as
the preempting task is descheduled, which is at most one timeslice for
timesharing tasks.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/
MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-04-01 Thread Nigel Gamble

On Sat, 31 Mar 2001, Rusty Russell wrote:
> > if (p->state == TASK_RUNNING ||
> > (p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
> > p->flags |= PF_SYNCING;
> 
> Setting a running task's flags brings races, AFAICT, and checking
> p->state is NOT sufficient, consider wait_event(): you need p->has_cpu
> here I think.

My thought here was that if p->state is anything other than TASK_RUNNING
or TASK_RUNNING|TASK_PREEMPTED, then that task is already at a
synchonize point, so we don't need to wait for it to arrive at another
one - it will get a consistent view of the data we are protecting.
wait_event() qualifies as a synchronize point, doesn't it?  Or am I
missing something?

> The only way I can see is to have a new element in "struct
> task_struct" saying "syncing now", which is protected by the runqueue
> lock.  This looks like (and I prefer wait queues, they have such nice
> helpers):
> 
>   static DECLARE_WAIT_QUEUE_HEAD(syncing_task);
>   static DECLARE_MUTEX(synchronize_kernel_mtx);
>   static int sync_count = 0;
> 
> schedule():
>   if (!(prev->state & TASK_PREEMPTED) && prev->syncing)
>   if (--sync_count == 0) wake_up(_task);

Don't forget to reset prev->syncing.  I agree with you about wait
queues, but didn't use them here because of the problem of avoiding
deadlock on the runqueue lock, which the wait queues also use.  The
above code in schedule needs the runqueue lock to protect sync_count.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-04-01 Thread Nigel Gamble

On Sat, 31 Mar 2001, george anzinger wrote:
> I think this should be:
> if (p->has_cpu || p->state & TASK_PREEMPTED)) {
> to catch tasks that were preempted with other states.

But the other states are all part of the state change that happens at a
non-preemtive schedule() point, aren't they, so those tasks are already
safe to access the data we are protecting.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-04-01 Thread Nigel Gamble


On Sat, 31 Mar 2001, george anzinger wrote:
 I think this should be:
 if (p-has_cpu || p-state  TASK_PREEMPTED)) {
 to catch tasks that were preempted with other states.

But the other states are all part of the state change that happens at a
non-preemtive schedule() point, aren't they, so those tasks are already
safe to access the data we are protecting.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-04-01 Thread Nigel Gamble


On Sat, 31 Mar 2001, Rusty Russell wrote:
  if (p-state == TASK_RUNNING ||
  (p-state == (TASK_RUNNING|TASK_PREEMPTED))) {
  p-flags |= PF_SYNCING;
 
 Setting a running task's flags brings races, AFAICT, and checking
 p-state is NOT sufficient, consider wait_event(): you need p-has_cpu
 here I think.

My thought here was that if p-state is anything other than TASK_RUNNING
or TASK_RUNNING|TASK_PREEMPTED, then that task is already at a
synchonize point, so we don't need to wait for it to arrive at another
one - it will get a consistent view of the data we are protecting.
wait_event() qualifies as a synchronize point, doesn't it?  Or am I
missing something?

 The only way I can see is to have a new element in "struct
 task_struct" saying "syncing now", which is protected by the runqueue
 lock.  This looks like (and I prefer wait queues, they have such nice
 helpers):
 
   static DECLARE_WAIT_QUEUE_HEAD(syncing_task);
   static DECLARE_MUTEX(synchronize_kernel_mtx);
   static int sync_count = 0;
 
 schedule():
   if (!(prev-state  TASK_PREEMPTED)  prev-syncing)
   if (--sync_count == 0) wake_up(syncing_task);

Don't forget to reset prev-syncing.  I agree with you about wait
queues, but didn't use them here because of the problem of avoiding
deadlock on the runqueue lock, which the wait queues also use.  The
above code in schedule needs the runqueue lock to protect sync_count.


Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-29 Thread Nigel Gamble


On Tue, 20 Mar 2001, Nigel Gamble wrote:
> On Tue, 20 Mar 2001, Rusty Russell wrote:
> > Thoughts?
> 
> Perhaps synchronize_kernel() could take the run_queue lock, mark all the
> tasks on it and count them.  Any task marked when it calls schedule()
> voluntarily (but not if it is preempted) is unmarked and the count
> decremented.  synchronize_kernel() continues until the count is zero.

Hi Rusty,

Here is an attempt at a possible version of synchronize_kernel() that
should work on a preemptible kernel.  I haven't tested it yet.


static int sync_count = 0;
static struct task_struct *syncing_task = NULL;
static DECLARE_MUTEX(synchronize_kernel_mtx);

void
synchronize_kernel()
{
struct list_head *tmp;
struct task_struct *p;

/* Guard against multiple calls to this function */
down(_kernel_mtx);

/* Mark all tasks on the runqueue */
spin_lock_irq(_lock);
list_for_each(tmp, _head) {
p = list_entry(tmp, struct task_struct, run_list);
if (p == current)
continue;
if (p->state == TASK_RUNNING ||
(p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
p->flags |= PF_SYNCING;
sync_count++;
}
}
if (sync_count == 0)
goto out;

syncing_task = current;
spin_unlock_irq(_lock);

/*
 * Cause a schedule on every CPU, as for a non-preemptible
 * kernel
 */

/* Save current state */
cpus_allowed = current->cpus_allowed;
policy = current->policy;
rt_priority = current->rt_priority;

/* Create an unreal time task. */
current->policy = SCHED_FIFO;
current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);

/* Make us schedulable on all CPUs. */
current->cpus_allowed = (1UL<cpus_allowed &= ~(1 << smp_processor_id())) != 0)
schedule();

/* Back to normal. */
current->cpus_allowed = cpus_allowed;
current->policy = policy;
current->rt_priority = rt_priority;

/*
 * Wait, if necessary, until all preempted tasks
 * have reached a sync point.
 */

spin_lock_irq(_lock);
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (sync_count == 0)
break;
spin_unlock_irq(_lock);
schedule();
spin_lock_irq(_lock);
}
current->state = TASK_RUNNING;
syncing_task =  NULL;
out:
spin_unlock_irq(_lock);

up(_kernel_mtx);
}

And add this code to the beginning of schedule(), just after the
runqueue_lock is taken (the flags field is probably not be the right
place to put the synchronize mark; and the test should be optimized for
the fast path in the same way as the other tests in schedule(), but you
get the idea):

if ((prev->flags & PF_SYNCING) && !(prev->state & TASK_PREEMPTED)) {
prev->flags &= ~PF_SYNCING;
if (--sync_count == 0) {
syncing_task->state = TASK_RUNNING;
if (!task_on_runqueue(syncing_task))
    add_to_runqueue(syncing_task);
syncing_task = NULL;
}

}

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-29 Thread Nigel Gamble


On Tue, 20 Mar 2001, Nigel Gamble wrote:
 On Tue, 20 Mar 2001, Rusty Russell wrote:
  Thoughts?
 
 Perhaps synchronize_kernel() could take the run_queue lock, mark all the
 tasks on it and count them.  Any task marked when it calls schedule()
 voluntarily (but not if it is preempted) is unmarked and the count
 decremented.  synchronize_kernel() continues until the count is zero.

Hi Rusty,

Here is an attempt at a possible version of synchronize_kernel() that
should work on a preemptible kernel.  I haven't tested it yet.


static int sync_count = 0;
static struct task_struct *syncing_task = NULL;
static DECLARE_MUTEX(synchronize_kernel_mtx);

void
synchronize_kernel()
{
struct list_head *tmp;
struct task_struct *p;

/* Guard against multiple calls to this function */
down(synchronize_kernel_mtx);

/* Mark all tasks on the runqueue */
spin_lock_irq(runqueue_lock);
list_for_each(tmp, runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (p == current)
continue;
if (p-state == TASK_RUNNING ||
(p-state == (TASK_RUNNING|TASK_PREEMPTED))) {
p-flags |= PF_SYNCING;
sync_count++;
}
}
if (sync_count == 0)
goto out;

syncing_task = current;
spin_unlock_irq(runqueue_lock);

/*
 * Cause a schedule on every CPU, as for a non-preemptible
 * kernel
 */

/* Save current state */
cpus_allowed = current-cpus_allowed;
policy = current-policy;
rt_priority = current-rt_priority;

/* Create an unreal time task. */
current-policy = SCHED_FIFO;
current-rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);

/* Make us schedulable on all CPUs. */
current-cpus_allowed = (1ULsmp_num_cpus)-1;

/* Eliminate current cpu, reschedule */
while ((current-cpus_allowed = ~(1  smp_processor_id())) != 0)
schedule();

/* Back to normal. */
current-cpus_allowed = cpus_allowed;
current-policy = policy;
current-rt_priority = rt_priority;

/*
 * Wait, if necessary, until all preempted tasks
 * have reached a sync point.
 */

spin_lock_irq(runqueue_lock);
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (sync_count == 0)
break;
spin_unlock_irq(runqueue_lock);
schedule();
spin_lock_irq(runqueue_lock);
}
current-state = TASK_RUNNING;
syncing_task =  NULL;
out:
spin_unlock_irq(runqueue_lock);

up(synchronize_kernel_mtx);
}

And add this code to the beginning of schedule(), just after the
runqueue_lock is taken (the flags field is probably not be the right
place to put the synchronize mark; and the test should be optimized for
the fast path in the same way as the other tests in schedule(), but you
get the idea):

if ((prev-flags  PF_SYNCING)  !(prev-state  TASK_PREEMPTED)) {
prev-flags = ~PF_SYNCING;
if (--sync_count == 0) {
syncing_task-state = TASK_RUNNING;
if (!task_on_runqueue(syncing_task))
add_to_runqueue(syncing_task);
syncing_task = NULL;
}

}

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-23 Thread Nigel Gamble


On Thu, 22 Mar 2001, Rusty Russell wrote:
> Nigel's "traverse the run queue and mark the preempted" solution is
> actually pretty nice, and cheap.  Since the runqueue lock is grabbed,
> it doesn't require icky atomic ops, either.

You'd have to mark both the preempted tasks, and the tasks currently
running on each CPU (which could become preempted before reaching a
voluntary schedule point).

> Despite Nigel's initial belief that this technique is fragile, I
> believe it will become an increasingly fundamental method in the
> kernel, so (with documentation) it will become widely understood, as
> it offers scalability and efficiency.

Actually, I agree with you now that I've had a chance to think about
this some more.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-23 Thread Nigel Gamble


On Thu, 22 Mar 2001, Rusty Russell wrote:
 Nigel's "traverse the run queue and mark the preempted" solution is
 actually pretty nice, and cheap.  Since the runqueue lock is grabbed,
 it doesn't require icky atomic ops, either.

You'd have to mark both the preempted tasks, and the tasks currently
running on each CPU (which could become preempted before reaching a
voluntary schedule point).

 Despite Nigel's initial belief that this technique is fragile, I
 believe it will become an increasingly fundamental method in the
 kernel, so (with documentation) it will become widely understood, as
 it offers scalability and efficiency.

Actually, I agree with you now that I've had a chance to think about
this some more.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

lock_kernel() usage and sync_*() functions

2001-03-21 Thread Nigel Gamble


Why is the kernel lock held around sync_supers() and sync_inodes() in
sync_old_buffers() and fsync_dev(), but not in sync_dev()?  Is it just
to serialize calls to these functions, or is there some other reason?

Since this use of the BKL is one of the causes of high preemption
latency in a preemptible kernel, I'm hoping it would be OK to replace
them with a semaphore.  Please let me know if this is not the case.

Thanks!

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-21 Thread Nigel Gamble


On Wed, 21 Mar 2001, Andrew Morton wrote:
> It's a problem for uniprocessors as well.
> 
> Example:
> 
> #define current_cpu_data boot_cpu_data
> #define pgd_quicklist (current_cpu_data.pgd_quick)
> 
> extern __inline__ void free_pgd_fast(pgd_t *pgd)
> {
> *(unsigned long *)pgd = (unsigned long) pgd_quicklist;
> pgd_quicklist = (unsigned long *) pgd;
> pgtable_cache_size++;
> }
> 
> Preemption could corrupt this list.

Thanks, Andrew, for pointing this out.  I've added fixes to the patch
for this problem and the others in pgalloc.h.  If you know of any other
similar problems on uniprocessors, please let me know.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-21 Thread Nigel Gamble

On Wed, 21 Mar 2001, David S. Miller wrote:
> Basically, anything which uses smp_processor_id() would need to
> be holding some lock so as to not get pre-empted.

Not necessarily.  Another solution for the smp_processor_id() case is
to ensure that the task can only be scheduled on the current CPU for the
duration that the value of smp_processor_id() is used.  Or, if the
critical region is very short, to disable interrupts on the local CPU.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-21 Thread Nigel Gamble


On Wed, 21 Mar 2001, Andrew Morton wrote:
 It's a problem for uniprocessors as well.
 
 Example:
 
 #define current_cpu_data boot_cpu_data
 #define pgd_quicklist (current_cpu_data.pgd_quick)
 
 extern __inline__ void free_pgd_fast(pgd_t *pgd)
 {
 *(unsigned long *)pgd = (unsigned long) pgd_quicklist;
 pgd_quicklist = (unsigned long *) pgd;
 pgtable_cache_size++;
 }
 
 Preemption could corrupt this list.

Thanks, Andrew, for pointing this out.  I've added fixes to the patch
for this problem and the others in pgalloc.h.  If you know of any other
similar problems on uniprocessors, please let me know.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

lock_kernel() usage and sync_*() functions

2001-03-21 Thread Nigel Gamble


Why is the kernel lock held around sync_supers() and sync_inodes() in
sync_old_buffers() and fsync_dev(), but not in sync_dev()?  Is it just
to serialize calls to these functions, or is there some other reason?

Since this use of the BKL is one of the causes of high preemption
latency in a preemptible kernel, I'm hoping it would be OK to replace
them with a semaphore.  Please let me know if this is not the case.

Thanks!

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-21 Thread Nigel Gamble


On Wed, 21 Mar 2001, David S. Miller wrote:
 Basically, anything which uses smp_processor_id() would need to
 be holding some lock so as to not get pre-empted.

Not necessarily.  Another solution for the smp_processor_id() case is
to ensure that the task can only be scheduled on the current CPU for the
duration that the value of smp_processor_id() is used.  Or, if the
critical region is very short, to disable interrupts on the local CPU.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-20 Thread Nigel Gamble

On Wed, 21 Mar 2001, Keith Owens wrote:
> I misread the code, but the idea is still correct.  Add a preemption
> depth counter to each cpu, when you schedule and the depth is zero then
> you know that the cpu is no longer holding any references to quiesced
> structures.

A task that has been preempted is on the run queue and can be
rescheduled on a different CPU, so I can't see how a per-CPU counter
would work.  It seems to me that you would need a per run queue
counter, like the example I gave in a previous posting.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-20 Thread Nigel Gamble

On Tue, 20 Mar 2001, Keith Owens wrote:
> The preemption patch only allows preemption from interrupt and only for
> a single level of preemption.  That coexists quite happily with
> synchronize_kernel() which runs in user context.  Just count user
> context schedules (preempt_count == 0), not preemptive schedules.

I'm not sure what you mean by "only for a single level of preemption."
It's possible for a preempting process to be preempted itself by a
higher priority process, and for that process to be preempted by an even
higher priority one, limited only by the number of processes waiting for
interrupt handlers to make them runnable.  This isn't very likely in
practice (kernel preemptions tend to be rare compared to normal calls to
schedule()), but it could happen in theory.

If you're looking at preempt_schedule(), note the call to ctx_sw_off()
only increments current->preempt_count for the preempted task - the
higher priority preempting task that is about to be scheduled will have
a preempt_count of 0.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-20 Thread Nigel Gamble


On Tue, 20 Mar 2001, Rusty Russell wrote:
>   I can see three problems with this approach, only one of which
> is serious.
> 
> The first is code which is already SMP unsafe is now a problem for
> everyone, not just the 0.1% of SMP machines.  I consider this a good
> thing for 2.5 though.

So do I.

> The second is that there are "manual" locking schemes which are used
> in several places in the kernel which rely on non-preemptability;
> de-facto spinlocks if you will.  I consider all these uses flawed: (1)
> they are often subtly broken anyway, (2) they make reading those parts
> of the code much harder, and (3) they break when things like this are
> done.

Likewise.

> The third is that preemtivity conflicts with the naive
> quiescent-period approach proposed for module unloading in 2.5, and
> useful for several other things (eg. hotplugging CPUs).  This method
> relies on knowing that when a schedule() has occurred on every CPU, we
> know noone is holding certain references.  The simplest example is a
> single linked list: you can traverse without a lock as long as you
> don't sleep, and then someone can unlink a node, and wait for a
> schedule on every other CPU before freeing it.  The non-SMP case is a
> noop.  See synchonize_kernel() below.

So, to make sure I understand this, the code to free a node would look
like:

prev->next = node->next; /* assumed to be atomic */
synchronize_kernel();
free(node);

So that any other CPU concurrently traversing the list would see a
consistent state, either including or not including "node" before the
call to synchronize_kernel(); but after synchronize_kernel() all other
CPUs are guaranteed to see a list that no longer includes "node", so it
is now safe to free it.

It looks like there are also implicit assumptions to this approach, like
no other CPU is trying to use the same approach simultaneously to free
"prev".  So my initial reaction is that this approach is, like the
manual locking schemes you commented on above, open to being subtly
broken when people don't understand all the implicit assumptions and
subsequently invalidate them.

> This, too, is soluble, but it means that synchronize_kernel() must
> guarantee that each task which was running or preempted in kernel
> space when it was called, has been non-preemtively scheduled before
> synchronize_kernel() can exit.  Icky.

Yes, you're right.

> Thoughts?

Perhaps synchronize_kernel() could take the run_queue lock, mark all the
tasks on it and count them.  Any task marked when it calls schedule()
voluntarily (but not if it is preempted) is unmarked and the count
decremented.  synchronize_kernel() continues until the count is zero.
As you said, "Icky."

> /* We could keep a schedule count for each CPU and make idle tasks
>schedule (some don't unless need_resched), but this scales quite
>well (eg. 64 processors, average time to wait for first schedule =
>jiffie/64.  Total time for all processors = jiffie/63 + jiffie/62...
> 
>At 1024 cpus, this is about 7.5 jiffies.  And that assumes noone
>schedules early. --RR */
> void synchronize_kernel(void)
> {
>   unsigned long cpus_allowed, policy, rt_priority;
> 
>   /* Save current state */
>   cpus_allowed = current->cpus_allowed;
>   policy = current->policy;
>   rt_priority = current->rt_priority;
> 
>   /* Create an unreal time task. */
>   current->policy = SCHED_FIFO;
>   current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);
> 
>   /* Make us schedulable on all CPUs. */
>   current->cpus_allowed = (1UL<   
>   /* Eliminate current cpu, reschedule */
>   while ((current->cpus_allowed &= ~(1 << smp_processor_id())) != 0)
>   schedule();
> 
>   /* Back to normal. */
>   current->cpus_allowed = cpus_allowed;
>   current->policy = policy;
>   current->rt_priority = rt_priority;
> }
> 

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-20 Thread Nigel Gamble

On Tue, 20 Mar 2001, Roger Larsson wrote:
> One little readability thing I found.
> The prev->state TASK_ value is mostly used as a plain value
> but the new TASK_PREEMPTED is or:ed together with whatever was there.
> Later when we switch to check the state it is checked against TASK_PREEMPTED
> only. Since TASK_RUNNING is 0 it works OK but...

Yes, you're right.  I had forgotten that TASK_RUNNING is 0 and I think I
was assuming that there could be (rare) cases where a task was preempted
while prev->state was in transition such that no other flags were set.
This is, of course, impossible given that TASK_RUNNING is 0.  So your
change makes the common case more obvious (to me, at least!)

> --- sched.c.nigel   Tue Mar 20 18:52:43 2001
> +++ sched.c.roger   Tue Mar 20 19:03:28 2001
> @@ -553,7 +553,7 @@
>  #endif
> del_from_runqueue(prev);
>  #ifdef CONFIG_PREEMPT
> -   case TASK_PREEMPTED:
> +   case TASK_RUNNING | TASK_PREEMPTED:
>  #endif
> case TASK_RUNNING:
> }
> 
> 
> We could add all/(other common) combinations as cases 
> 
>   switch (prev->state) {
>   case TASK_INTERRUPTIBLE:
>   if (signal_pending(prev)) {
>   prev->state = TASK_RUNNING;
>   break;
>   }
>   default:
> #ifdef CONFIG_PREEMPT
>   if (prev->state & TASK_PREEMPTED)
>   break;
> #endif
>   del_from_runqueue(prev);
> #ifdef CONFIG_PREEMPT
>   case TASK_RUNNING   | TASK_PREEMPTED:
>   case TASK_INTERRUPTIBLE | TASK_PREEMPTED:
>   case TASK_UNINTERRUPTIBLE   | TASK_PREEMPTED:
> #endif
>   case TASK_RUNNING:
>   }
> 
> 
> Then the break in default case could almost be replaced with a BUG()...
> (I have not checked the generated code)

The other cases are not very common, as they only happen if a task is
preempted during the short time that it is running while in the process
of changing state while going to sleep or waking up, so the default case
is probably OK for them; and I'd be happier to leave the default case
for reliability reasons anyway.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-20 Thread Nigel Gamble


On Tue, 20 Mar 2001, Rusty Russell wrote:
   I can see three problems with this approach, only one of which
 is serious.
 
 The first is code which is already SMP unsafe is now a problem for
 everyone, not just the 0.1% of SMP machines.  I consider this a good
 thing for 2.5 though.

So do I.

 The second is that there are "manual" locking schemes which are used
 in several places in the kernel which rely on non-preemptability;
 de-facto spinlocks if you will.  I consider all these uses flawed: (1)
 they are often subtly broken anyway, (2) they make reading those parts
 of the code much harder, and (3) they break when things like this are
 done.

Likewise.

 The third is that preemtivity conflicts with the naive
 quiescent-period approach proposed for module unloading in 2.5, and
 useful for several other things (eg. hotplugging CPUs).  This method
 relies on knowing that when a schedule() has occurred on every CPU, we
 know noone is holding certain references.  The simplest example is a
 single linked list: you can traverse without a lock as long as you
 don't sleep, and then someone can unlink a node, and wait for a
 schedule on every other CPU before freeing it.  The non-SMP case is a
 noop.  See synchonize_kernel() below.

So, to make sure I understand this, the code to free a node would look
like:

prev-next = node-next; /* assumed to be atomic */
synchronize_kernel();
free(node);

So that any other CPU concurrently traversing the list would see a
consistent state, either including or not including "node" before the
call to synchronize_kernel(); but after synchronize_kernel() all other
CPUs are guaranteed to see a list that no longer includes "node", so it
is now safe to free it.

It looks like there are also implicit assumptions to this approach, like
no other CPU is trying to use the same approach simultaneously to free
"prev".  So my initial reaction is that this approach is, like the
manual locking schemes you commented on above, open to being subtly
broken when people don't understand all the implicit assumptions and
subsequently invalidate them.

 This, too, is soluble, but it means that synchronize_kernel() must
 guarantee that each task which was running or preempted in kernel
 space when it was called, has been non-preemtively scheduled before
 synchronize_kernel() can exit.  Icky.

Yes, you're right.

 Thoughts?

Perhaps synchronize_kernel() could take the run_queue lock, mark all the
tasks on it and count them.  Any task marked when it calls schedule()
voluntarily (but not if it is preempted) is unmarked and the count
decremented.  synchronize_kernel() continues until the count is zero.
As you said, "Icky."

 /* We could keep a schedule count for each CPU and make idle tasks
schedule (some don't unless need_resched), but this scales quite
well (eg. 64 processors, average time to wait for first schedule =
jiffie/64.  Total time for all processors = jiffie/63 + jiffie/62...
 
At 1024 cpus, this is about 7.5 jiffies.  And that assumes noone
schedules early. --RR */
 void synchronize_kernel(void)
 {
   unsigned long cpus_allowed, policy, rt_priority;
 
   /* Save current state */
   cpus_allowed = current-cpus_allowed;
   policy = current-policy;
   rt_priority = current-rt_priority;
 
   /* Create an unreal time task. */
   current-policy = SCHED_FIFO;
   current-rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);
 
   /* Make us schedulable on all CPUs. */
   current-cpus_allowed = (1ULsmp_num_cpus)-1;
   
   /* Eliminate current cpu, reschedule */
   while ((current-cpus_allowed = ~(1  smp_processor_id())) != 0)
   schedule();
 
   /* Back to normal. */
   current-cpus_allowed = cpus_allowed;
   current-policy = policy;
       current-rt_priority = rt_priority;
 }
 

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-20 Thread Nigel Gamble


On Tue, 20 Mar 2001, Keith Owens wrote:
 The preemption patch only allows preemption from interrupt and only for
 a single level of preemption.  That coexists quite happily with
 synchronize_kernel() which runs in user context.  Just count user
 context schedules (preempt_count == 0), not preemptive schedules.

I'm not sure what you mean by "only for a single level of preemption."
It's possible for a preempting process to be preempted itself by a
higher priority process, and for that process to be preempted by an even
higher priority one, limited only by the number of processes waiting for
interrupt handlers to make them runnable.  This isn't very likely in
practice (kernel preemptions tend to be rare compared to normal calls to
schedule()), but it could happen in theory.

If you're looking at preempt_schedule(), note the call to ctx_sw_off()
only increments current-preempt_count for the preempted task - the
higher priority preempting task that is about to be scheduled will have
a preempt_count of 0.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-20 Thread Nigel Gamble


On Wed, 21 Mar 2001, Keith Owens wrote:
 I misread the code, but the idea is still correct.  Add a preemption
 depth counter to each cpu, when you schedule and the depth is zero then
 you know that the cpu is no longer holding any references to quiesced
 structures.

A task that has been preempted is on the run queue and can be
rescheduled on a different CPU, so I can't see how a per-CPU counter
would work.  It seems to me that you would need a per run queue
counter, like the example I gave in a previous posting.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-19 Thread Nigel Gamble

Hi Pavel,

Thanks for you comments.

On Sat, 17 Mar 2001, Pavel Machek wrote:
> > diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
> > --- 2.4.2/arch/i386/kernel/traps.c  Wed Mar 14 12:16:46 2001
> > +++ linux/arch/i386/kernel/traps.c  Wed Mar 14 12:22:45 2001
> > @@ -973,7 +973,7 @@
> > set_trap_gate(11,_not_present);
> > set_trap_gate(12,_segment);
> > set_trap_gate(13,_protection);
> > -   set_trap_gate(14,_fault);
> > +   set_intr_gate(14,_fault);
> > set_trap_gate(15,_interrupt_bug);
> > set_trap_gate(16,_error);
> > set_trap_gate(17,_check);
> 
> Are you sure about this piece? Add least add a comment, because it
> *looks* strange.

With a preemptible kernel, we need to enter the page fault handler with
interrupts disabled to protect the cr2 register.  The interrupt state is
restored immediately after cr2 has been saved.  Otherwise, an interrupt
could cause the faulting thread to be preempted, and the new thread
could also fault, clobbering the cr2 register for the preempted thread.
See the diff for linux/arch/i386/mm/fault.c.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-19 Thread Nigel Gamble


Hi Pavel,

Thanks for you comments.

On Sat, 17 Mar 2001, Pavel Machek wrote:
  diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
  --- 2.4.2/arch/i386/kernel/traps.c  Wed Mar 14 12:16:46 2001
  +++ linux/arch/i386/kernel/traps.c  Wed Mar 14 12:22:45 2001
  @@ -973,7 +973,7 @@
  set_trap_gate(11,segment_not_present);
  set_trap_gate(12,stack_segment);
  set_trap_gate(13,general_protection);
  -   set_trap_gate(14,page_fault);
  +   set_intr_gate(14,page_fault);
  set_trap_gate(15,spurious_interrupt_bug);
  set_trap_gate(16,coprocessor_error);
  set_trap_gate(17,alignment_check);
 
 Are you sure about this piece? Add least add a comment, because it
 *looks* strange.

With a preemptible kernel, we need to enter the page fault handler with
interrupts disabled to protect the cr2 register.  The interrupt state is
restored immediately after cr2 has been saved.  Otherwise, an interrupt
could cause the faulting thread to be preempted, and the new thread
could also fault, clobbering the cr2 register for the preempted thread.
See the diff for linux/arch/i386/mm/fault.c.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Locking question (was: [CHECKER] 9 potential copy_*_user bugs in2.4.1)

2001-03-16 Thread Nigel Gamble

On Thu, 15 Mar 2001, Dawson Engler wrote:
>   2.  And, unrelated:  given the current locking discipline, is
>   it bad to hold any type of lock (not just a spin lock) when you
>   call a potentially blocking function?  (It at least seems bad
>   for performance, since you'll hold the lock for milliseconds.)

In general, yes.  The lock may be held for much longer than milliseconds
if the potentially blocking function is waiting for I/O from a network,
or a terminal, potentially causing all threads to block on the lock
until someone presses a key, in this extreme example.  If the lock is a
spinlock, then complete deadlock can occur.

You're probably aware that semaphores are used both as blocking mutex
locks, where the down (lock) and up (unlock) calls are made by the same
thread to protect critical data, and as a synchronization mechanism,
where the down and up calls are made by different threads.   The former
use is a "lock", while the latter down() use is a "potentially blocking
function" in terms of your question.  I don't know how easy it would be
for your analysis tools to distinguish between them.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Locking question (was: [CHECKER] 9 potential copy_*_user bugs in2.4.1)

2001-03-16 Thread Nigel Gamble


On Thu, 15 Mar 2001, Dawson Engler wrote:
   2.  And, unrelated:  given the current locking discipline, is
   it bad to hold any type of lock (not just a spin lock) when you
   call a potentially blocking function?  (It at least seems bad
   for performance, since you'll hold the lock for milliseconds.)

In general, yes.  The lock may be held for much longer than milliseconds
if the potentially blocking function is waiting for I/O from a network,
or a terminal, potentially causing all threads to block on the lock
until someone presses a key, in this extreme example.  If the lock is a
spinlock, then complete deadlock can occur.

You're probably aware that semaphores are used both as blocking mutex
locks, where the down (lock) and up (unlock) calls are made by the same
thread to protect critical data, and as a synchronization mechanism,
where the down and up calls are made by different threads.   The former
use is a "lock", while the latter down() use is a "potentially blocking
function" in terms of your question.  I don't know how easy it would be
for your analysis tools to distinguish between them.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

MontaVista Software [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH for 2.5] preemptible kernel

2001-03-14 Thread Nigel Gamble


Here is the latest preemptible kernel patch.  It's much cleaner and
smaller than previous versions, so I've appended it to this mail.  This
patch is against 2.4.2, although it's not intended for 2.4.  I'd like
comments from anyone interested in a low-latency Linux kernel solution
for the 2.5 development tree.

Kernel preemption is not allowed while spinlocks are held, which means
that this patch alone cannot guarantee low preemption latencies.  But
as long held locks (in particular the BKL) are replaced by finer-grained
locks, this patch will enable lower latencies as the kernel also becomes
more scalable on large SMP systems.

Notwithstanding the comments in the Configure.help section for
CONFIG_PREEMPT, I think this patch has a negligible effect on
throughput.  In fact, I got better average results from running 'dbench
16' on a 750MHz PIII with 128MB with kernel preemption turned on
(~30MB/s) than on the plain 2.4.2 kernel (~26MB/s).

(I had to rearrange three headers files that are needed in sched.h before
task_struct is defined, but which include inline functions that cannot
now be compiled until after task_struct is defined.  I chose not to
move them into sched.h, like d_path(), as I don't want to make it more
difficult to apply kernel patches to my kernel source tree.)

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/


diff -Nur 2.4.2/CREDITS linux/CREDITS
--- 2.4.2/CREDITS   Wed Mar 14 12:15:49 2001
+++ linux/CREDITS   Wed Mar 14 12:21:42 2001
@@ -907,8 +907,8 @@
 
 N: Nigel Gamble
 E: [EMAIL PROTECTED]
-E: [EMAIL PROTECTED]
 D: Interrupt-driven printer driver
+D: Preemptible kernel
 S: 120 Alley Way
 S: Mountain View, California 94040
 S: USA
diff -Nur 2.4.2/Documentation/Configure.help linux/Documentation/Configure.help
--- 2.4.2/Documentation/Configure.help  Wed Mar 14 12:16:10 2001
+++ linux/Documentation/Configure.help  Wed Mar 14 12:22:04 2001
@@ -130,6 +130,23 @@
   If you have system with several CPU's, you do not need to say Y
   here: APIC will be used automatically.
 
+Preemptible Kernel
+CONFIG_PREEMPT
+  This option reduces the latency of the kernel when reacting to
+  real-time or interactive events by allowing a low priority process to
+  be preempted even if it is in kernel mode executing a system call.
+  This allows applications that need real-time response, such as audio
+  and other multimedia applications, to run more reliably even when the
+  system is under load due to other, lower priority, processes.
+
+  This option is currently experimental if used in conjuction with SMP
+  support.
+
+  Say Y here if you are building a kernel for a desktop system, embedded
+  system or real-time system.  Say N if you are building a kernel for a
+  system where throughput is more important than interactive response,
+  such as a server system.  Say N if you are unsure.
+
 Kernel math emulation
 CONFIG_MATH_EMULATION
   Linux can emulate a math coprocessor (used for floating point
diff -Nur 2.4.2/arch/i386/config.in linux/arch/i386/config.in
--- 2.4.2/arch/i386/config.in   Wed Mar 14 12:14:18 2001
+++ linux/arch/i386/config.in   Wed Mar 14 12:20:02 2001
@@ -161,6 +161,11 @@
   define_bool CONFIG_X86_IO_APIC y
   define_bool CONFIG_X86_LOCAL_APIC y
fi
+   bool 'Preemptible Kernel' CONFIG_PREEMPT
+else
+   if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
+  bool 'Preemptible SMP Kernel (EXPERIMENTAL)' CONFIG_PREEMPT
+   fi
 fi
 
 if [ "$CONFIG_SMP" = "y" -a "$CONFIG_X86_CMPXCHG" = "y" ]; then
diff -Nur 2.4.2/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- 2.4.2/arch/i386/kernel/entry.S  Wed Mar 14 12:17:37 2001
+++ linux/arch/i386/kernel/entry.S  Wed Mar 14 12:23:42 2001
@@ -72,7 +72,7 @@
  * these are offsets into the task-struct.
  */
 state  =  0
-flags  =  4
+preempt_count  =  4
 sigpending =  8
 addr_limit = 12
 exec_domain= 16
@@ -80,8 +80,30 @@
 tsk_ptrace = 24
 processor  = 52
 
+/* These are offsets into the irq_stat structure
+ * There is one per cpu and it is aligned to 32
+ * byte boundry (we put that here as a shift count)
+ */
+irq_array_shift = CONFIG_X86_L1_CACHE_SHIFT
+
+irq_stat_softirq_active = 0
+irq_stat_softirq_mask   = 4
+irq_stat_local_irq_count= 8
+irq_stat_local_bh_count = 12
+
 ENOSYS = 38
 
+#ifdef CONFIG_SMP
+#define GET_CPU_INDX   movl processor(%ebx),%eax;  \
+shll $irq_array_shift,%eax
+#define GET_CURRENT_CPU_INDX GET_CURRENT(%ebx); \
+ GET_CPU_INDX
+#define CPU_INDX (,%eax)
+#else
+#define GET_CPU_INDX
+#define GET_CURRENT_CPU_INDX GET_CURRENT(%ebx)
+#define CPU_INDX
+#endif
 
 #define SAVE_ALL \
cld; \
@@ -270,16 +292,44 @@
 #endif
jne   handle_softirq
 
+#ifdef CONFIG_PREEMPT

[PATCH for 2.5] preemptible kernel

2001-03-14 Thread Nigel Gamble


Here is the latest preemptible kernel patch.  It's much cleaner and
smaller than previous versions, so I've appended it to this mail.  This
patch is against 2.4.2, although it's not intended for 2.4.  I'd like
comments from anyone interested in a low-latency Linux kernel solution
for the 2.5 development tree.

Kernel preemption is not allowed while spinlocks are held, which means
that this patch alone cannot guarantee low preemption latencies.  But
as long held locks (in particular the BKL) are replaced by finer-grained
locks, this patch will enable lower latencies as the kernel also becomes
more scalable on large SMP systems.

Notwithstanding the comments in the Configure.help section for
CONFIG_PREEMPT, I think this patch has a negligible effect on
throughput.  In fact, I got better average results from running 'dbench
16' on a 750MHz PIII with 128MB with kernel preemption turned on
(~30MB/s) than on the plain 2.4.2 kernel (~26MB/s).

(I had to rearrange three headers files that are needed in sched.h before
task_struct is defined, but which include inline functions that cannot
now be compiled until after task_struct is defined.  I chose not to
move them into sched.h, like d_path(), as I don't want to make it more
difficult to apply kernel patches to my kernel source tree.)

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/


diff -Nur 2.4.2/CREDITS linux/CREDITS
--- 2.4.2/CREDITS   Wed Mar 14 12:15:49 2001
+++ linux/CREDITS   Wed Mar 14 12:21:42 2001
@@ -907,8 +907,8 @@
 
 N: Nigel Gamble
 E: [EMAIL PROTECTED]
-E: [EMAIL PROTECTED]
 D: Interrupt-driven printer driver
+D: Preemptible kernel
 S: 120 Alley Way
 S: Mountain View, California 94040
 S: USA
diff -Nur 2.4.2/Documentation/Configure.help linux/Documentation/Configure.help
--- 2.4.2/Documentation/Configure.help  Wed Mar 14 12:16:10 2001
+++ linux/Documentation/Configure.help  Wed Mar 14 12:22:04 2001
@@ -130,6 +130,23 @@
   If you have system with several CPU's, you do not need to say Y
   here: APIC will be used automatically.
 
+Preemptible Kernel
+CONFIG_PREEMPT
+  This option reduces the latency of the kernel when reacting to
+  real-time or interactive events by allowing a low priority process to
+  be preempted even if it is in kernel mode executing a system call.
+  This allows applications that need real-time response, such as audio
+  and other multimedia applications, to run more reliably even when the
+  system is under load due to other, lower priority, processes.
+
+  This option is currently experimental if used in conjuction with SMP
+  support.
+
+  Say Y here if you are building a kernel for a desktop system, embedded
+  system or real-time system.  Say N if you are building a kernel for a
+  system where throughput is more important than interactive response,
+  such as a server system.  Say N if you are unsure.
+
 Kernel math emulation
 CONFIG_MATH_EMULATION
   Linux can emulate a math coprocessor (used for floating point
diff -Nur 2.4.2/arch/i386/config.in linux/arch/i386/config.in
--- 2.4.2/arch/i386/config.in   Wed Mar 14 12:14:18 2001
+++ linux/arch/i386/config.in   Wed Mar 14 12:20:02 2001
@@ -161,6 +161,11 @@
   define_bool CONFIG_X86_IO_APIC y
   define_bool CONFIG_X86_LOCAL_APIC y
fi
+   bool 'Preemptible Kernel' CONFIG_PREEMPT
+else
+   if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
+  bool 'Preemptible SMP Kernel (EXPERIMENTAL)' CONFIG_PREEMPT
+   fi
 fi
 
 if [ "$CONFIG_SMP" = "y" -a "$CONFIG_X86_CMPXCHG" = "y" ]; then
diff -Nur 2.4.2/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- 2.4.2/arch/i386/kernel/entry.S  Wed Mar 14 12:17:37 2001
+++ linux/arch/i386/kernel/entry.S  Wed Mar 14 12:23:42 2001
@@ -72,7 +72,7 @@
  * these are offsets into the task-struct.
  */
 state  =  0
-flags  =  4
+preempt_count  =  4
 sigpending =  8
 addr_limit = 12
 exec_domain= 16
@@ -80,8 +80,30 @@
 tsk_ptrace = 24
 processor  = 52
 
+/* These are offsets into the irq_stat structure
+ * There is one per cpu and it is aligned to 32
+ * byte boundry (we put that here as a shift count)
+ */
+irq_array_shift = CONFIG_X86_L1_CACHE_SHIFT
+
+irq_stat_softirq_active = 0
+irq_stat_softirq_mask   = 4
+irq_stat_local_irq_count= 8
+irq_stat_local_bh_count = 12
+
 ENOSYS = 38
 
+#ifdef CONFIG_SMP
+#define GET_CPU_INDX   movl processor(%ebx),%eax;  \
+shll $irq_array_shift,%eax
+#define GET_CURRENT_CPU_INDX GET_CURRENT(%ebx); \
+ GET_CPU_INDX
+#define CPU_INDX (,%eax)
+#else
+#define GET_CPU_INDX
+#define GET_CURRENT_CPU_INDX GET_CURRENT(%ebx)
+#define CPU_INDX
+#endif
 
 #define SAVE_ALL \
cld; \
@@ -270,16 +292,44 @@
 #endif
jne   handle_softirq
 
+#ifdef CONFIG_PREEMPT

Re: spinlock help

2001-03-06 Thread Nigel Gamble


On Tue, 6 Mar 2001, Manoj Sontakke wrote:
> 1. when spin_lock_irqsave() function is called the subsequent code is
> executed untill spin_unloc_irqrestore()is called. is this right?

Yes.  The protected code will not be interrupted, or simultaneously
executed by another CPU.

> 2. is this sequence valid?
>   spin_lock_irqsave(a,b);
>   spin_lock_irqsave(c,d);

Yes, as long as it is followed by:

spin_unlock_irqrestore(c, d);
spin_unlock_irqrestore(a, b);

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: spinlock help

2001-03-06 Thread Nigel Gamble


On Tue, 6 Mar 2001, Manoj Sontakke wrote:
 1. when spin_lock_irqsave() function is called the subsequent code is
 executed untill spin_unloc_irqrestore()is called. is this right?

Yes.  The protected code will not be interrupted, or simultaneously
executed by another CPU.

 2. is this sequence valid?
   spin_lock_irqsave(a,b);
   spin_lock_irqsave(c,d);

Yes, as long as it is followed by:

spin_unlock_irqrestore(c, d);
spin_unlock_irqrestore(a, b);

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-21 Thread Nigel Gamble

On Sun, 21 Jan 2001, Paul Barton-Davis wrote:
> >Let me just point out that Victor has his own commercial axe to grind in
> >his continual bad-mouthing of IRIX, the internals of which he knows
> >nothing about.
> 
> 1) do you actually disagree with victor ?

Yes, I most emphatically do disagree with Victor!  IRIX is used for
mission-critical audio applications - recording as well playback - and
other low-latency applications.  The same OS scales to large numbers of
CPUs.  And it has the best desktop interactive response of any OS I've
used.  I will be very happy when Linux is as good in all these areas,
and I'm working hard to achieve this goal with negligible impact on the
current Linux "sweet-spot" applications such as web serving.

> this discussion has the hallmarks of turning into a personal
> bash-fest, which is really pointless. what is *not* pointless is a
> considered discussion about the merits of the IRIX "RT" approach over
> possible approaches that Linux might take which are dissimilar to the
> IRIX one. on the other hand, as Victor said, a large part of that
> discussion ultimately comes down to a design style rather than hard
> factual or logical reasoning.

I agree.  I'm not wedded to any particular design - I just want a
low-latency Linux by whatever is the best way of achieving that.
However, I am hearing Victor say that we shouldn't try to make Linux
itself low-latency, we should just use his so-called "RTLinux" environment
for low-latency tasks.  RTLinux is not Linux, it is a separate
environment with a separate, limited set of APIs.  You can't run XMMS,
or any other existing Linux audio app in RTLinux.  I want a low-latency
Linux, not just another RTOS living parasitically alongside Linux.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-21 Thread Nigel Gamble


On Sat, 20 Jan 2001 [EMAIL PROTECTED] wrote:
> Let me just point out that Nigel (I think) has previously stated that
> the purpose of this approach is to bring the stunning success of 
> IRIX style "RT" to Linux. Since some of us believe that IRIX is a virtual
> handbook of OS errors, it really comes down to a design style. I think
> that simplicity and "does the main job well" wins every time over 
> "really cool algorithms" and "does everything badly". Others 
> disagree.

Let me just point out that Victor has his own commercial axe to grind in
his continual bad-mouthing of IRIX, the internals of which he knows
nothing about.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-21 Thread Nigel Gamble


On Sat, 20 Jan 2001 [EMAIL PROTECTED] wrote:
 Let me just point out that Nigel (I think) has previously stated that
 the purpose of this approach is to bring the stunning success of 
 IRIX style "RT" to Linux. Since some of us believe that IRIX is a virtual
 handbook of OS errors, it really comes down to a design style. I think
 that simplicity and "does the main job well" wins every time over 
 "really cool algorithms" and "does everything badly". Others 
 disagree.

Let me just point out that Victor has his own commercial axe to grind in
his continual bad-mouthing of IRIX, the internals of which he knows
nothing about.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-21 Thread Nigel Gamble


On Sun, 21 Jan 2001, Paul Barton-Davis wrote:
 Let me just point out that Victor has his own commercial axe to grind in
 his continual bad-mouthing of IRIX, the internals of which he knows
 nothing about.
 
 1) do you actually disagree with victor ?

Yes, I most emphatically do disagree with Victor!  IRIX is used for
mission-critical audio applications - recording as well playback - and
other low-latency applications.  The same OS scales to large numbers of
CPUs.  And it has the best desktop interactive response of any OS I've
used.  I will be very happy when Linux is as good in all these areas,
and I'm working hard to achieve this goal with negligible impact on the
current Linux "sweet-spot" applications such as web serving.

 this discussion has the hallmarks of turning into a personal
 bash-fest, which is really pointless. what is *not* pointless is a
 considered discussion about the merits of the IRIX "RT" approach over
 possible approaches that Linux might take which are dissimilar to the
 IRIX one. on the other hand, as Victor said, a large part of that
 discussion ultimately comes down to a design style rather than hard
 factual or logical reasoning.

I agree.  I'm not wedded to any particular design - I just want a
low-latency Linux by whatever is the best way of achieving that.
However, I am hearing Victor say that we shouldn't try to make Linux
itself low-latency, we should just use his so-called "RTLinux" environment
for low-latency tasks.  RTLinux is not Linux, it is a separate
environment with a separate, limited set of APIs.  You can't run XMMS,
or any other existing Linux audio app in RTLinux.  I want a low-latency
Linux, not just another RTOS living parasitically alongside Linux.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Latency: allowing resheduling while holding spin_locks

2001-01-13 Thread Nigel Gamble


On Sat, 13 Jan 2001, Roger Larsson wrote:
> A rethinking of the rescheduling strategy...

Actually, I think you have more-or-less described how successful
preemptible kernels have already been developed, given that your
"sleeping spin locks" are really just sleeping mutexes (or binary
semaphores).

1.  Short critical regions are protected by spin_lock_irq().  The maximum
value of "short" is therefore bounded by the maximum time we are happy
to disable (local) interrupts - ideally ~100us.

2.  Longer regions are protected by sleeping mutexes.

3.  Algorithms are rearchitected until all of the highly contended locks
are of type 1, and only low contention locks are of type 2.

This approach has the advantage that we don't need to use a no-preempt
count, and test it on exit from every spinlock to see if a preempting
interrupt that has caused a need_resched has occurred, since we won't
see the interrupt until it's safe to do the preemptive resched.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Latency: allowing resheduling while holding spin_locks

2001-01-13 Thread Nigel Gamble


On Sat, 13 Jan 2001, Roger Larsson wrote:
 A rethinking of the rescheduling strategy...

Actually, I think you have more-or-less described how successful
preemptible kernels have already been developed, given that your
"sleeping spin locks" are really just sleeping mutexes (or binary
semaphores).

1.  Short critical regions are protected by spin_lock_irq().  The maximum
value of "short" is therefore bounded by the maximum time we are happy
to disable (local) interrupts - ideally ~100us.

2.  Longer regions are protected by sleeping mutexes.

3.  Algorithms are rearchitected until all of the highly contended locks
are of type 1, and only low contention locks are of type 2.

This approach has the advantage that we don't need to use a no-preempt
count, and test it on exit from every spinlock to see if a preempting
interrupt that has caused a need_resched has occurred, since we won't
see the interrupt until it's safe to do the preemptive resched.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-12 Thread Nigel Gamble

On Sat, 13 Jan 2001, Andrew Morton wrote:
> Nigel Gamble wrote:
> > Spinlocks should not be held for lots of time.  This adversely affects
> > SMP scalability as well as latency.  That's why MontaVista's kernel
> > preemption patch uses sleeping mutex locks instead of spinlocks for the
> > long held locks.
> 
> Nigel,
> 
> what worries me about this is the Apache-flock-serialisation saga.
> 
> Back in -test8, kumon@fujitsu demonstrated that changing this:
> 
>   lock_kernel()
>   down(sem)
>   
>   up(sem)
>   unlock_kernel()
> 
> into this:
> 
>   down(sem)
>   
>   up(sem)
> 
> had the effect of *decreasing* Apache's maximum connection rate
> on an 8-way from ~5,000 connections/sec to ~2,000 conn/sec.
> 
> That's downright scary.
> 
> Obviously,  was very quick, and the CPUs were passing through
> this section at a great rate.

Yes, this demonstrates that spinlocks are preferable to sleep locks for
short sections.  However, it looks to me like the implementation of up()
may be partly to blame.  It looks to me as if it tends to prefer to
context switch to the woken up process, instead of continuing to run the
current process.  Surrounding the semaphore with the BKL has the effect
of enforcing the latter behavior, because the semaphore itself will
never have any waiters.

> How can we be sure that converting spinlocks to semaphores
> won't do the same thing?  Perhaps for workloads which we
> aren't testing?
> 
> So this needs to be done with caution.
> 
> As davem points out, now we know where the problems are
> occurring, a good next step is to redesign some of those
> parts of the VM and buffercache.  I don't think this will
> be too hard, but they have to *want* to change :)

Yes, wherever the code can be redesigned to avoid long held locks, that
would definitely be my preferred solution.  I think everyone would be
happy if we could end up with a maintainable solution using only
spinlocks that are held for no longer than a couple of hundred
microseconds.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-12 Thread Nigel Gamble


On Fri, 12 Jan 2001, Tim Wright wrote:

> On Sat, Jan 13, 2001 at 12:30:46AM +1100, Andrew Morton wrote:
> > what worries me about this is the Apache-flock-serialisation saga.
> > 
> > Back in -test8, kumon@fujitsu demonstrated that changing this:
> > 
> > lock_kernel()
> > down(sem)
> > 
> > up(sem)
> > unlock_kernel()
> > 
> > into this:
> > 
> > down(sem)
> > 
> > up(sem)
> > 
> > had the effect of *decreasing* Apache's maximum connection rate
> > on an 8-way from ~5,000 connections/sec to ~2,000 conn/sec.
> > 
> > That's downright scary.
> > 
> > Obviously,  was very quick, and the CPUs were passing through
> > this section at a great rate.
> > 
> > How can we be sure that converting spinlocks to semaphores
> > won't do the same thing?  Perhaps for workloads which we
> > aren't testing?
> > 
> > So this needs to be done with caution.
> > 
> 
> Hmmm...
> if  is very quick, and is guaranteed not to sleep, then a semaphore
> is the wrong way to protect it. A spinlock is the correct choice. If it's
> always slow, and can sleep, then a semaphore makes more sense, although if
> it's highly contented, you're going to serialize and throughput will die.
> At that point, you need to redesign :-)
> If it's mostly quick but occasionally needs to sleep, I don't know what the
> correct idiom would be in Linux. DYNIX/ptx has the concept of atomically
> releasing a spinlock and going to sleep on a semaphore, and that would be
> the solution there e.g.
> 
> p_lock(lock);
> retry:
> ...
> if (condition where we need to sleep) {
> p_sema_v_lock(sema, lock);
> /* we got woken up */
> p_lock(lock);
> goto retry;
> }
> ...
> 
> I'm stating the obvious here, and re-iterating what you said, and that is that
> we need to carefully pick the correct primitive for the job. Unless there's
> something very unusual in the Linux implementation that I've missed, a
> spinlock is a "cheaper" method of protecting a short critical section, and
> should be chosen.
> 
> I know the BKL is a semantically a little unusual (the automatic release on
> sleep stuff), but even so, isn't
> 
>   lock_kernel()
>   down(sem)
>   
>   up(sem)
>   unlock_kernel()
> 
> actually equivalent to
> 
>   lock_kernel()
>   
>   unlock_kernel()
> 
> If so, it's no great surprise that performance dropped given that we replaced
> a spinlock (albeit one guarding somewhat more than the critical section) with
> a semaphore.
> 
> Tim
> 
> --
> Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED]
> IBM Linux Technology Center, Beaverton, Oregon
> "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
> 

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-12 Thread Nigel Gamble


On Sat, 13 Jan 2001, Andrew Morton wrote:
 Nigel Gamble wrote:
  Spinlocks should not be held for lots of time.  This adversely affects
  SMP scalability as well as latency.  That's why MontaVista's kernel
  preemption patch uses sleeping mutex locks instead of spinlocks for the
  long held locks.
 
 Nigel,
 
 what worries me about this is the Apache-flock-serialisation saga.
 
 Back in -test8, kumon@fujitsu demonstrated that changing this:
 
   lock_kernel()
   down(sem)
   stuff
   up(sem)
   unlock_kernel()
 
 into this:
 
   down(sem)
   stuff
   up(sem)
 
 had the effect of *decreasing* Apache's maximum connection rate
 on an 8-way from ~5,000 connections/sec to ~2,000 conn/sec.
 
 That's downright scary.
 
 Obviously, stuff was very quick, and the CPUs were passing through
 this section at a great rate.

Yes, this demonstrates that spinlocks are preferable to sleep locks for
short sections.  However, it looks to me like the implementation of up()
may be partly to blame.  It looks to me as if it tends to prefer to
context switch to the woken up process, instead of continuing to run the
current process.  Surrounding the semaphore with the BKL has the effect
of enforcing the latter behavior, because the semaphore itself will
never have any waiters.

 How can we be sure that converting spinlocks to semaphores
 won't do the same thing?  Perhaps for workloads which we
 aren't testing?
 
 So this needs to be done with caution.
 
 As davem points out, now we know where the problems are
 occurring, a good next step is to redesign some of those
 parts of the VM and buffercache.  I don't think this will
 be too hard, but they have to *want* to change :)

Yes, wherever the code can be redesigned to avoid long held locks, that
would definitely be my preferred solution.  I think everyone would be
happy if we could end up with a maintainable solution using only
spinlocks that are held for no longer than a couple of hundred
microseconds.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-11 Thread Nigel Gamble

On Wed, 10 Jan 2001, David S. Miller wrote:
> Opinion: Personally, I think the approach in Andrew's patch
>is the way to go.
> 
>Not because it can give the absolute best results.
>But rather, it is because it says "here is where a lot
>  of time is spent".
> 
>This has two huge benefits:
>1) It tells us where possible algorithmic improvements may
>   be possible.  In some cases we may be able to improve the
>   code to the point where the pre-emption points are no
>   longer necessary and can thus be removed.

This is definitely an important goal.  But lock-metering code in a fully
preemptible kernel an also identify spots where algorithmic improvements
are most important.

>2) It affects only code which can burn a lot of cpu without
>   scheduling.  Compare this to schemes which make the kernel
>   fully pre-emptable, causing _EVERYONE_ to pay the price of
>   low-latency.  If we were to later fine algorithmic
>   improvements to the high-latency pieces of code, we
> couldn't then just "undo" support for pre-emption because
>   dependencies will have swept across the whole kernel
>   already.
> 
> Pre-emption, by itself, also doesn't help in situations
>   where lots of time is spent while holding spinlocks.
>   There are several other operating systems which support
>   pre-emption where you will find hard coded calls to the
>   scheduler in time-consuming code.  Heh, it's almost like,
>   "what's the frigging point of pre-emption then if you
>   still have to manually check in some spots?"

Spinlocks should not be held for lots of time.  This adversely affects
SMP scalability as well as latency.  That's why MontaVista's kernel
preemption patch uses sleeping mutex locks instead of spinlocks for the
long held locks.  In a fully preemptible kernel that is implemented
correctly, you won't find any hard-coded calls to the scheduler in time
consuming code.  The scheduler should only be called in response to an
interrupt (IO or timeout) when we know that a higher priority process
has been made runnable, or when the running process sleeps (voluntarily
or when it has to wait for something) or exits.  This is the case in
both of the fully preemptible kernels which I've worked on (IRIX and
REAL/IX).

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-11 Thread Nigel Gamble


On Wed, 10 Jan 2001, David S. Miller wrote:
 Opinion: Personally, I think the approach in Andrew's patch
is the way to go.
 
Not because it can give the absolute best results.
But rather, it is because it says "here is where a lot
  of time is spent".
 
This has two huge benefits:
1) It tells us where possible algorithmic improvements may
   be possible.  In some cases we may be able to improve the
   code to the point where the pre-emption points are no
   longer necessary and can thus be removed.

This is definitely an important goal.  But lock-metering code in a fully
preemptible kernel an also identify spots where algorithmic improvements
are most important.

2) It affects only code which can burn a lot of cpu without
   scheduling.  Compare this to schemes which make the kernel
   fully pre-emptable, causing _EVERYONE_ to pay the price of
   low-latency.  If we were to later fine algorithmic
   improvements to the high-latency pieces of code, we
 couldn't then just "undo" support for pre-emption because
   dependencies will have swept across the whole kernel
   already.
 
 Pre-emption, by itself, also doesn't help in situations
   where lots of time is spent while holding spinlocks.
   There are several other operating systems which support
   pre-emption where you will find hard coded calls to the
   scheduler in time-consuming code.  Heh, it's almost like,
   "what's the frigging point of pre-emption then if you
   still have to manually check in some spots?"

Spinlocks should not be held for lots of time.  This adversely affects
SMP scalability as well as latency.  That's why MontaVista's kernel
preemption patch uses sleeping mutex locks instead of spinlocks for the
long held locks.  In a fully preemptible kernel that is implemented
correctly, you won't find any hard-coded calls to the scheduler in time
consuming code.  The scheduler should only be called in response to an
interrupt (IO or timeout) when we know that a higher priority process
has been made runnable, or when the running process sleeps (voluntarily
or when it has to wait for something) or exits.  This is the case in
both of the fully preemptible kernels which I've worked on (IRIX and
REAL/IX).

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble

On Thu, 4 Jan 2001, ludovic fernandez wrote:
> This is not the point I was trying to make .
> So far we are talking about real time behaviour. This is a very interesting/exciting
> thing and we all agree it's a huge task which goes much more behind
> just having a preemptive kernel.

You're right that it is more than just a preemptible kernel, but I don't
agree that it's all that huge.  But this is the third time I have worked
on enabling real-time behavior in unix-like OSes, so I may be biased ;-)

> I'm not convinced that a preemptive kernel is interesting for apps using
> the time sharing scheduling, mainly because it is not deterministic and the
> price of a mmu conntext switch is still way to heavy (that's my 2 cents belief
> anyway).

But as Roger pointed out, the number of extra context switches
introduced by having a preemptible kernel is actually very low.  If an
interrupt occurs while running in user mode, the context switch it may
cause will happen even in a non-preemptible kernel.  I think that
running a kernel compile for example, the number of context switches per
second caused by kernel preemption is probably between 1% and 10% of the
total context switches per second.  And it's certainly interesting to me
that I can listen to MP3s without interruption now, while doing a kernel
build!

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble

On Thu, 4 Jan 2001, Andi Kleen wrote:
> On Thu, Jan 04, 2001 at 01:39:57PM -0800, Nigel Gamble wrote:
> > Experience has shown that adaptive spinlocks are not worth the extra
> > overhead (if you mean the type that spin for a short time
> > and then decide to sleep).  It is better to use spin_lock_irqsave()
> > (which, by definition, disables kernel preemption without the need
> > to set a no-preempt flag) to protect regions where the lock is held
> > for a maximum of around 100us, and to use a sleeping mutex lock for
> > longer regions.  This is what I'm working towards.
> 
> What experience ?  Only real-time latency testing or SMP scalability 
> testing? 

Both.  We spent a lot of time on this when I was at SGI working on IRIX.
I think we ended up with excellent SMP scalability and good real-time
latency.  There is also some academic research that suggests that
the extra overhead of a dynamic adaptive spinlock usually outweighs
any possible gains.

> The case I was thinking about is a heavily contended lock like the
> inode semaphore of a file that is used by several threads on several
> CPUs in parallel or the mm semaphore of a often faulted shared mm. 
> 
> It's not an option to convert them to a spinlock, but often the delays
> are short enough that a short spin could make sense. 

I think the first order performance problem of a heavily contended lock
is not how it is implemented, but the fact that it is heavily contended.
In IRIX we spent a lot of time looking for these bottlenecks and
re-architecting to avoid them.  (This would mean minimizing the shared
accesses in your examples.)

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble

On Thu, 4 Jan 2001, Andi Kleen wrote:
> The problem is that current Linux semaphores are very costly locks -- they
> always cause a context switch.

My preemptible kernel patch currently just uses Linux semaphores to
implement sleeping kernel mutexes, but we (at MontaVista Software) are
working on a new implementation that also does priority inheritance,
to avoid the priority inversion problem, and that does the minimum
necessary context switches.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble

On Thu, 4 Jan 2001, Andi Kleen wrote:
> On Thu, Jan 04, 2001 at 08:35:02AM +0100, Daniel Phillips wrote:
> > A more ambitious way to proceed is to change spinlocks so they can sleep
> > (not in interrupts of course).  There would not be any extra overhead
> 
> Imagine what happens when a non sleeping spinlock in a interrupt waits 
> for a "sleeping spinlock" somewhere else...
> I'm not sure if this is a good idea. Sleeping locks everywhere would
> imply scheduled interrupts, which are nasty. 

Yes, you have to make sure that you never call a sleeping lock
while holding a spinlock.  And you can't call a sleeping lock from
interrupt handlers in the current model.  But this is easy to avoid.

> I think a better way to proceed would be to make semaphores a bit more 
> intelligent and turn them into something like adaptive spinlocks and use
> them more where appropiate (currently using semaphores usually causes
> lots of context switches where some could probably be avoided). Problem
> is that for some cases like your producer-consumer pattern (which has been
> used previously in unreleased kernel code BTW) it would be a pessimization
> to spin, so such adaptive locks would probably need a different name.

Experience has shown that adaptive spinlocks are not worth the extra
overhead (if you mean the type that spin for a short time
and then decide to sleep).  It is better to use spin_lock_irqsave()
(which, by definition, disables kernel preemption without the need
to set a no-preempt flag) to protect regions where the lock is held
for a maximum of around 100us, and to use a sleeping mutex lock for
longer regions.  This is what I'm working towards.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble

On Thu, 4 Jan 2001, Daniel Phillips wrote:
> A more ambitious way to proceed is to change spinlocks so they can sleep
> (not in interrupts of course).  There would not be any extra overhead
> for this on spin_lock (because the sleep test is handled off the fast
> path) but spin_unlock gets a little slower - it has to test and jump on
> a flag if there are sleepers.

I already have a preemption patch that also changes the longest
held spinlocks into sleep locks, i.e. the locks that are routinely
held for > 1ms.  This gives a kernel with very good interactive
response, good enough for most audio apps.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble

On Wed, 3 Jan 2001, ludovic fernandez wrote:
> For hackers,
> The following patch makes the kernel preemptable.
> It is against 2.4.0-prerelease on for i386 only.
> It should work for UP and SMP even though I
> didn't validate it on SMP.
> Comments are welcome.

Hi Ludo,

I didn't realise you were still working on this.  Did you know that
I am also?  Our most recent version is at:

ftp://ftp.mvista.com/pub/Area51/preemptible_kernel/

although I have yet to put up a 2.4.0-prerelease patch (coming soon).
We should probably pool our efforts on this for 2.5.

Cheers,
Nigel

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble


On Wed, 3 Jan 2001, ludovic fernandez wrote:
 For hackers,
 The following patch makes the kernel preemptable.
 It is against 2.4.0-prerelease on for i386 only.
 It should work for UP and SMP even though I
 didn't validate it on SMP.
 Comments are welcome.

Hi Ludo,

I didn't realise you were still working on this.  Did you know that
I am also?  Our most recent version is at:

ftp://ftp.mvista.com/pub/Area51/preemptible_kernel/

although I have yet to put up a 2.4.0-prerelease patch (coming soon).
We should probably pool our efforts on this for 2.5.

Cheers,
Nigel

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble


On Thu, 4 Jan 2001, Daniel Phillips wrote:
 A more ambitious way to proceed is to change spinlocks so they can sleep
 (not in interrupts of course).  There would not be any extra overhead
 for this on spin_lock (because the sleep test is handled off the fast
 path) but spin_unlock gets a little slower - it has to test and jump on
 a flag if there are sleepers.

I already have a preemption patch that also changes the longest
held spinlocks into sleep locks, i.e. the locks that are routinely
held for  1ms.  This gives a kernel with very good interactive
response, good enough for most audio apps.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble


On Thu, 4 Jan 2001, Andi Kleen wrote:
 The problem is that current Linux semaphores are very costly locks -- they
 always cause a context switch.

My preemptible kernel patch currently just uses Linux semaphores to
implement sleeping kernel mutexes, but we (at MontaVista Software) are
working on a new implementation that also does priority inheritance,
to avoid the priority inversion problem, and that does the minimum
necessary context switches.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread Nigel Gamble


On Thu, 4 Jan 2001, Andi Kleen wrote:
 On Thu, Jan 04, 2001 at 01:39:57PM -0800, Nigel Gamble wrote:
  Experience has shown that adaptive spinlocks are not worth the extra
  overhead (if you mean the type that spin for a short time
  and then decide to sleep).  It is better to use spin_lock_irqsave()
  (which, by definition, disables kernel preemption without the need
  to set a no-preempt flag) to protect regions where the lock is held
  for a maximum of around 100us, and to use a sleeping mutex lock for
  longer regions.  This is what I'm working towards.
 
 What experience ?  Only real-time latency testing or SMP scalability 
 testing? 

Both.  We spent a lot of time on this when I was at SGI working on IRIX.
I think we ended up with excellent SMP scalability and good real-time
latency.  There is also some academic research that suggests that
the extra overhead of a dynamic adaptive spinlock usually outweighs
any possible gains.

 The case I was thinking about is a heavily contended lock like the
 inode semaphore of a file that is used by several threads on several
 CPUs in parallel or the mm semaphore of a often faulted shared mm. 
 
 It's not an option to convert them to a spinlock, but often the delays
 are short enough that a short spin could make sense. 

I think the first order performance problem of a heavily contended lock
is not how it is implemented, but the fact that it is heavily contended.
In IRIX we spent a lot of time looking for these bottlenecks and
re-architecting to avoid them.  (This would mean minimizing the shared
accesses in your examples.)

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Semaphores used for daemon wakeup

2000-12-18 Thread Nigel Gamble


On Sun, 17 Dec 2000, Daniel Phillips wrote:
> This patch illustrates an alternative approach to waking and waiting on
> daemons using semaphores instead of direct operations on wait queues.
> The idea of using semaphores to regulate the cycling of a daemon was
> suggested to me by Arjan Vos.  The basic idea is simple: on each cycle
> a daemon down's a semaphore, and is reactivated when some other task
> up's the semaphore.

> Is this better, worse, or lateral?

This is much better, especially from a maintainability point of view.
It is also the method that a lot of operating systems already use.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Semaphores used for daemon wakeup

2000-12-18 Thread Nigel Gamble


On Sun, 17 Dec 2000, Daniel Phillips wrote:
 This patch illustrates an alternative approach to waking and waiting on
 daemons using semaphores instead of direct operations on wait queues.
 The idea of using semaphores to regulate the cycling of a daemon was
 suggested to me by Arjan Vos.  The basic idea is simple: on each cycle
 a daemon down's a semaphore, and is reactivated when some other task
 up's the semaphore.

 Is this better, worse, or lateral?

This is much better, especially from a maintainability point of view.
It is also the method that a lot of operating systems already use.

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[PATCH] Latest preemptible kernel (low latency) patch available

2000-11-22 Thread Nigel Gamble


MontaVista Software's latest preemptible kernel patch,
preempt-2.4.0-test11-1.patch.bz2, is now available in
ftp://ftp.mvista.com/pub/Area51/preemptible_kernel/
Here is an extract from the README file:

The patches in this directory, when applied to the corresponding
kernel source, will define a new configure option, 'Preemptable Kernel',
under the 'Processor type and features' section.  When enabled, and the
kernel is rebuilt it will be fully preemptable, subject to SMP lock
areas (i.e. it uses SMP locking on a UP to control preemptability).

The patch can only be enabled for ix86 uniprocessor platforms.
(Stay tuned for other platforms and SMP support.)

Notes for preempt-2.4.0-test11-1.patch
--

 - Updated to kernel 2.4.0-test11

Notes for preempt-2.4.0-test10-1.patch
--

The main changes between this and previous patches are:

 - Updated to kernel 2.4.0-test10
 - Long held spinlocks changed into mutex locks, currently implemented
   using semaphores.  (We are working on a fast, priority inheriting,
   binary semaphore implementation of these locks.)

The patch gives good results on Benno's Audio-Latency test
http://www.gardena.net/benno/linux/audio/, with maximum
latencies less than a couple of milliseconds recorded
using a 750MHz PIII machine.  However, there are still
some >10ms non-preemptible paths that are not exercised
by this test.

The worst non-preemtible paths are now dominated by the big
kernel lock, which we hope can be completely eliminated in 2.5
by finer grained locks.

(I will be at the Linux Real-Time Workshop in Orlando next week, and
may not be able to access my work email address ([EMAIL PROTECTED]),
which is why I'm posting this from my personal address.)

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[PATCH] Latest preemptible kernel (low latency) patch available

2000-11-22 Thread Nigel Gamble


MontaVista Software's latest preemptible kernel patch,
preempt-2.4.0-test11-1.patch.bz2, is now available in
ftp://ftp.mvista.com/pub/Area51/preemptible_kernel/
Here is an extract from the README file:

The patches in this directory, when applied to the corresponding
kernel source, will define a new configure option, 'Preemptable Kernel',
under the 'Processor type and features' section.  When enabled, and the
kernel is rebuilt it will be fully preemptable, subject to SMP lock
areas (i.e. it uses SMP locking on a UP to control preemptability).

The patch can only be enabled for ix86 uniprocessor platforms.
(Stay tuned for other platforms and SMP support.)

Notes for preempt-2.4.0-test11-1.patch
--

 - Updated to kernel 2.4.0-test11

Notes for preempt-2.4.0-test10-1.patch
--

The main changes between this and previous patches are:

 - Updated to kernel 2.4.0-test10
 - Long held spinlocks changed into mutex locks, currently implemented
   using semaphores.  (We are working on a fast, priority inheriting,
   binary semaphore implementation of these locks.)

The patch gives good results on Benno's Audio-Latency test
http://www.gardena.net/benno/linux/audio/, with maximum
latencies less than a couple of milliseconds recorded
using a 750MHz PIII machine.  However, there are still
some 10ms non-preemptible paths that are not exercised
by this test.

The worst non-preemtible paths are now dominated by the big
kernel lock, which we hope can be completely eliminated in 2.5
by finer grained locks.

(I will be at the Linux Real-Time Workshop in Orlando next week, and
may not be able to access my work email address ([EMAIL PROTECTED]),
which is why I'm posting this from my personal address.)

Nigel Gamble[EMAIL PROTECTED]
Mountain View, CA, USA. http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Locking problem in autofs4_expire(), 2.4.0-test10

2000-11-03 Thread Nigel Gamble


dput() is called with dcache_lock already held, resulting in deadlock.

Here is a suggested fix:

= expire.c 1.3 vs edited =
--- 1.3/linux/fs/autofs4/expire.c   Tue Oct 31 15:14:06 2000
+++ edited/expire.c Fri Nov  3 17:47:47 2000
@@ -223,8 +223,10 @@
mntput(p);
return dentry;
}
+   spin_unlock(_lock);
dput(d);
mntput(p);
+   spin_lock(_lock);
}
spin_unlock(_lock);

Nigel Gamble
MontaVista Software

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Locking problem in autofs4_expire(), 2.4.0-test10

2000-11-03 Thread Nigel Gamble


dput() is called with dcache_lock already held, resulting in deadlock.

Here is a suggested fix:

= expire.c 1.3 vs edited =
--- 1.3/linux/fs/autofs4/expire.c   Tue Oct 31 15:14:06 2000
+++ edited/expire.c Fri Nov  3 17:47:47 2000
@@ -223,8 +223,10 @@
mntput(p);
return dentry;
}
+   spin_unlock(dcache_lock);
dput(d);
mntput(p);
+   spin_lock(dcache_lock);
}
spin_unlock(dcache_lock);

Nigel Gamble
MontaVista Software

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Weightless process class

2000-10-04 Thread Nigel Gamble

On Wed, 4 Oct 2000, Rik van Riel wrote:
> On Wed, 4 Oct 2000, LA Walsh wrote:
> 
> > I had another thought regarding resource scheduling -- has the
> > idea of a "weightless" process been brought up? 
> 
> Yes, look for "idle priority", etc..
> It also turned out to have some problems ...
> 
> > Weightless means it doesn't count toward 'load' and the class
> > strictly has lowest priority in the system and gets *no* CPU
> > unless there are "idle" cycles.  So even a process niced to -19
> > could CPU starve a weightless process.
> 
> One problem here is that you might end up with a weightless
> process having grabbed a superblock lock, after which a
> normal priority CPU hog kicks in and starves the weightless
> process.
> 
> The result is that that superblock lock never gets released,
> and everybody needing to grab that lock blocks forever, even
> if they have a higher priority than the CPU hog that's starving
> our idle process...
> 
> The solution to this would be only starve these processes
> when they are in user space and can no longer be holding
> any kernel locks.

The general solution, which SGI implements in IRIX, is to implement
priority inheritance for blocking locks.  So the weightless process
gets the priority of the blocked process until it releases the lock.
IRIX multi-reader semaphores initially did not implement priority
inheritance, until this type of starvation scenario occured!

I'm working on making the linux kernel fully preemptible (as I
did for IRIX when I used to work at SGI), and will need
priority inheritance mutexes to enable real-time behavior for
SCHED_FIFO and SCHED_RR tasks.  So someone at MontaVista will
be looking at this in the 2.5 timeframe.

Nigel Gamble
[EMAIL PROTECTED]
www.mvista.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Weightless process class

2000-10-04 Thread Nigel Gamble


On Wed, 4 Oct 2000, Rik van Riel wrote:
 On Wed, 4 Oct 2000, LA Walsh wrote:
 
  I had another thought regarding resource scheduling -- has the
  idea of a "weightless" process been brought up? 
 
 Yes, look for "idle priority", etc..
 It also turned out to have some problems ...
 
  Weightless means it doesn't count toward 'load' and the class
  strictly has lowest priority in the system and gets *no* CPU
  unless there are "idle" cycles.  So even a process niced to -19
  could CPU starve a weightless process.
 
 One problem here is that you might end up with a weightless
 process having grabbed a superblock lock, after which a
 normal priority CPU hog kicks in and starves the weightless
 process.
 
 The result is that that superblock lock never gets released,
 and everybody needing to grab that lock blocks forever, even
 if they have a higher priority than the CPU hog that's starving
 our idle process...
 
 The solution to this would be only starve these processes
 when they are in user space and can no longer be holding
 any kernel locks.

The general solution, which SGI implements in IRIX, is to implement
priority inheritance for blocking locks.  So the weightless process
gets the priority of the blocked process until it releases the lock.
IRIX multi-reader semaphores initially did not implement priority
inheritance, until this type of starvation scenario occured!

I'm working on making the linux kernel fully preemptible (as I
did for IRIX when I used to work at SGI), and will need
priority inheritance mutexes to enable real-time behavior for
SCHED_FIFO and SCHED_RR tasks.  So someone at MontaVista will
be looking at this in the 2.5 timeframe.

Nigel Gamble
[EMAIL PROTECTED]
www.mvista.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Interrupt/Sleep deadlock

2000-09-28 Thread Nigel Gamble


You could use a semaphore for this.  Initialize it to 0, then call
down() from the ioctl, and up() from the interrupt handler.  If the
up() happens before the down(), the down() won't go to sleep.

Nigel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Interrupt/Sleep deadlock

2000-09-28 Thread Nigel Gamble


You could use a semaphore for this.  Initialize it to 0, then call
down() from the ioctl, and up() from the interrupt handler.  If the
up() happens before the down(), the down() won't go to sleep.

Nigel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

87 matches

Mail list logo