Re: Linux 2.4.1-ac15

2001-02-23 Thread Andrea Arcangeli

This new one should be better. Using the spinlock for the SMP serialization is
ok because that's an extremely slow path.

diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/fault.c 
2.4.2aa/arch/i386/mm/fault.c
--- 2.4.2/arch/i386/mm/fault.c  Thu Feb 22 03:44:53 2001
+++ 2.4.2aa/arch/i386/mm/fault.cSat Feb 24 05:41:11 2001
@@ -326,23 +326,27 @@
int offset = __pgd_offset(address);
pgd_t *pgd, *pgd_k;
pmd_t *pmd, *pmd_k;
+   static spinlock_t lazy_vmalloc_lock = SPIN_LOCK_UNLOCKED;
+   unsigned long flags;
 
pgd = tsk->active_mm->pgd + offset;
pgd_k = init_mm.pgd + offset;
 
-   if (!pgd_present(*pgd)) {
-   if (!pgd_present(*pgd_k))
-   goto bad_area_nosemaphore;
+   spin_lock_irqsave(_vmalloc_lock, flags);
+   if (!pgd_present(*pgd) && pgd_present(*pgd_k)) {
set_pgd(pgd, *pgd_k);
+   spin_unlock_irqrestore(_vmalloc_lock, flags);
return;
}
-
pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);
 
-   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
-   goto bad_area_nosemaphore;
-   set_pmd(pmd, *pmd_k);
-   return;
+   if (!pmd_present(*pmd) && pmd_present(*pmd_k)) {
+   set_pmd(pmd, *pmd_k);
+   spin_unlock_irqrestore(_vmalloc_lock, flags);
+   return;
+   }
+   spin_unlock_irqrestore(_vmalloc_lock, flags);
+   goto bad_area_nosemaphore;
}
 }
diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/init.c 
2.4.2aa/arch/i386/mm/init.c
--- 2.4.2/arch/i386/mm/init.c   Sat Feb 10 02:34:03 2001
+++ 2.4.2aa/arch/i386/mm/init.c Fri Feb 23 19:09:25 2001
@@ -116,21 +116,13 @@
pte_t *pte;
 
pte = (pte_t *) __get_free_page(GFP_KERNEL);
-   if (pmd_none(*pmd)) {
-   if (pte) {
-   clear_page(pte);
-   set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
-   return pte + offset;
-   }
-   set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table(;
-   return NULL;
+   if (pte) {
+   clear_page(pte);
+   set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
+   return pte + offset;
}
-   free_page((unsigned long)pte);
-   if (pmd_bad(*pmd)) {
-   __handle_bad_pmd_kernel(pmd);
-   return NULL;
-   }
-   return (pte_t *) pmd_page(*pmd) + offset;
+   set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table(;
+   return NULL;
 }
 
 pte_t *get_pte_slow(pmd_t *pmd, unsigned long offset)
diff -urN -X /home/andrea/bin/dontdiff 2.4.2/include/asm-i386/pgalloc-3level.h 
2.4.2aa/include/asm-i386/pgalloc-3level.h
--- 2.4.2/include/asm-i386/pgalloc-3level.h Fri Dec  3 20:12:23 1999
+++ 2.4.2aa/include/asm-i386/pgalloc-3level.h   Fri Feb 23 19:03:14 2001
@@ -53,12 +53,9 @@
if (!page)
page = get_pmd_slow();
if (page) {
-   if (pgd_none(*pgd)) {
-   set_pgd(pgd, __pgd(1 + __pa(page)));
-   __flush_tlb();
-   return page + address;
-   } else
-   free_pmd_fast(page);
+   set_pgd(pgd, __pgd(1 + __pa(page)));
+   __flush_tlb();
+   return page + address;
} else
return NULL;
}


non pae mode is still untested but it should now have a chance to work.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-23 Thread Andrea Arcangeli

On Fri, Feb 23, 2001 at 01:09:02PM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 23 Feb 2001, Andrea Arcangeli wrote:
> > 
> > I think that can't happen. Infact I think the whole section:
> > 
> > pmd = pmd_offset(pgd, address);
> > pmd_k = pmd_offset(pgd_k, address);
> > 
> > if (pmd_present(*pmd) || !pmd_present(*pmd_k))
> > goto bad_area_nosemaphore;
> > set_pmd(pmd, *pmd_k);
> > return;
> > 
> > is superflows.
> 
> No. Think about the differences in PAE and non-PAE mode. 
> 
> > The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it
> > just points to the as well shared pte), it's only the pgd that is setup lazily
> 
> In non-PAE mode, the pgd entry doesn't exist. pgd_present() returns a
> unconditional 1. Its' the pmd functions that kick in then.

Woops, I see, I was thinking only PAE mode indeed 8), sorry. Thanks for the
correction.

All the rest still applies (and the patch still looks fine for PAE mode). I
think I only need to rediff my patch after resurrecting the pmd thing inside
the cli critical section in a #ifndef CONFIG_X86_PAE region.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-23 Thread Linus Torvalds



On Fri, 23 Feb 2001, Andrea Arcangeli wrote:
> 
> I think that can't happen. Infact I think the whole section:
> 
>   pmd = pmd_offset(pgd, address);
>   pmd_k = pmd_offset(pgd_k, address);
> 
>   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
>   goto bad_area_nosemaphore;
>   set_pmd(pmd, *pmd_k);
>   return;
> 
> is superflows.

No. Think about the differences in PAE and non-PAE mode. 

> The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it
> just points to the as well shared pte), it's only the pgd that is setup lazily

In non-PAE mode, the pgd entry doesn't exist. pgd_present() returns a
unconditional 1. Its' the pmd functions that kick in then.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-23 Thread Andrea Arcangeli

On Thu, Feb 22, 2001 at 10:29:58AM +, Alan Cox wrote:
> > >We can take page faults in interrupt handlers in 2.4 so I had to use a 
> > >spinlock, but that sounds the same
> > 
> > Umm? The above doesn't really make sense.
> > 
> > We can take a page fault on the kernel region with the lazy page
> > directory filling, but that code will just set the PGD entry and exit
> > without taking any lock at all. So it basically ends up being an
> > "invisible" event.
> 
> Its only normally invisible. Mark Hemment pointed out there is currently a
> race where if both cpus go to fill in the same entry the logic goes
> 
>   CPU1CPU2
> 
>   pgd present pgd present
>   pmd not present
>   load pmd
>   pmd present
>   Explode messily
> 

I think that can't happen. Infact I think the whole section:

pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);

if (pmd_present(*pmd) || !pmd_present(*pmd_k))
goto bad_area_nosemaphore;
set_pmd(pmd, *pmd_k);
return;

is superflows.

The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it
just points to the as well shared pte), it's only the pgd that is setup lazily
to simplify the locking (locking is simplified of an order of magnitude because
for example in set_64bit we don't need the lock on the bus but we just need to
do the 64bit move in a single instruction so we can't be interrupted in the
middle by an irq and all sort of similar issues that now becomes serialized
inside the local CPU).

So in short unless I'm misreading something I think all the vmalloc faults will
be handled by the:

if (!pgd_present(*pgd)) {
if (!pgd_present(*pgd_k))
goto bad_area_nosemaphore;
set_pgd(pgd, *pgd_k);
return;
}

Said that there are a couple of other issues with the vmalloc pgd lazy handling
but nothing specific to SMP or threads.

1) we can triple fault (page_fault -> irq -> vmalloc fault -> irq -> vmalloc fault)
   and I'm not sure if the cpu will reset after that because the three faults are 
interleaved
   with two irq exceptions, and it's not so easy to find the answer empirically, and
   the documentation doesn't specify that apparently. If it resets (it shouldn't if the
   cpu was well designed) then the whole vmalloc lazy design will be hardly fixable.

2) we could race with irqs locally to the CPU this way during the vmalloc handler:

irq handler
vmalloc_fault

nested irq_handler
vmalloc fault
pgd is not present
set_pgd
iret

pgd is present
enter the pmd business and find the pmd is just set
goto error

   and I suspect this is actually the bug triggered by Mark.

3) offtopic with the vmalloc thing but I also noticed in some place during the
   pgd/pmd/pte allocations we re-check that nobody mapped in the pagetable from
   under us. But in 2.4 we don't hold the big lock so sleeping or not sleeping during
   allocations is meaningless w.r.t. serialization, we rely only the semaphore
   and on the fact other tasks can't play with our pagetables (one of the reason
   dropping set_pgdir simplifys the locking).

So I did this patch that should be the correct cure for the vmalloc pgd irq race (2)
and that drops a few double checks that seems unnecessary (2) and hopefully (1)
is not an issue and the cpu is smart enough to understand it must not hard
reset:

diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/fault.c 
2.4.2aa/arch/i386/mm/fault.c
--- 2.4.2/arch/i386/mm/fault.c  Thu Feb 22 03:44:53 2001
+++ 2.4.2aa/arch/i386/mm/fault.cFri Feb 23 21:15:53 2001
@@ -326,23 +326,19 @@
int offset = __pgd_offset(address);
pgd_t *pgd, *pgd_k;
pmd_t *pmd, *pmd_k;
+   unsigned long flags;
 
pgd = tsk->active_mm->pgd + offset;
pgd_k = init_mm.pgd + offset;
 
-   if (!pgd_present(*pgd)) {
-   if (!pgd_present(*pgd_k))
-   goto bad_area_nosemaphore;
+   __save_flags(flags);
+   __cli();
+   if (!pgd_present(*pgd) && pgd_present(*pgd_k)) {
set_pgd(pgd, *pgd_k);
+   __restore_flags(flags);
return;
}
-
-   pmd = pmd_offset(pgd, address);
-   pmd_k = pmd_offset(pgd_k, address);
-
-   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
-   goto bad_area_nosemaphore;
-   set_pmd(pmd, *pmd_k);
-   return;
+   

Re: Linux 2.4.1-ac15

2001-02-23 Thread Andrea Arcangeli

On Thu, Feb 22, 2001 at 10:29:58AM +, Alan Cox wrote:
  We can take page faults in interrupt handlers in 2.4 so I had to use a 
  spinlock, but that sounds the same
  
  Umm? The above doesn't really make sense.
  
  We can take a page fault on the kernel region with the lazy page
  directory filling, but that code will just set the PGD entry and exit
  without taking any lock at all. So it basically ends up being an
  "invisible" event.
 
 Its only normally invisible. Mark Hemment pointed out there is currently a
 race where if both cpus go to fill in the same entry the logic goes
 
   CPU1CPU2
 
   pgd present pgd present
   pmd not present
   load pmd
   pmd present
   Explode messily
 

I think that can't happen. Infact I think the whole section:

pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);

if (pmd_present(*pmd) || !pmd_present(*pmd_k))
goto bad_area_nosemaphore;
set_pmd(pmd, *pmd_k);
return;

is superflows.

The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it
just points to the as well shared pte), it's only the pgd that is setup lazily
to simplify the locking (locking is simplified of an order of magnitude because
for example in set_64bit we don't need the lock on the bus but we just need to
do the 64bit move in a single instruction so we can't be interrupted in the
middle by an irq and all sort of similar issues that now becomes serialized
inside the local CPU).

So in short unless I'm misreading something I think all the vmalloc faults will
be handled by the:

if (!pgd_present(*pgd)) {
if (!pgd_present(*pgd_k))
goto bad_area_nosemaphore;
set_pgd(pgd, *pgd_k);
return;
}

Said that there are a couple of other issues with the vmalloc pgd lazy handling
but nothing specific to SMP or threads.

1) we can triple fault (page_fault - irq - vmalloc fault - irq - vmalloc fault)
   and I'm not sure if the cpu will reset after that because the three faults are 
interleaved
   with two irq exceptions, and it's not so easy to find the answer empirically, and
   the documentation doesn't specify that apparently. If it resets (it shouldn't if the
   cpu was well designed) then the whole vmalloc lazy design will be hardly fixable.

2) we could race with irqs locally to the CPU this way during the vmalloc handler:

irq handler
vmalloc_fault

nested irq_handler
vmalloc fault
pgd is not present
set_pgd
iret

pgd is present
enter the pmd business and find the pmd is just set
goto error

   and I suspect this is actually the bug triggered by Mark.

3) offtopic with the vmalloc thing but I also noticed in some place during the
   pgd/pmd/pte allocations we re-check that nobody mapped in the pagetable from
   under us. But in 2.4 we don't hold the big lock so sleeping or not sleeping during
   allocations is meaningless w.r.t. serialization, we rely only the semaphore
   and on the fact other tasks can't play with our pagetables (one of the reason
   dropping set_pgdir simplifys the locking).

So I did this patch that should be the correct cure for the vmalloc pgd irq race (2)
and that drops a few double checks that seems unnecessary (2) and hopefully (1)
is not an issue and the cpu is smart enough to understand it must not hard
reset:

diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/fault.c 
2.4.2aa/arch/i386/mm/fault.c
--- 2.4.2/arch/i386/mm/fault.c  Thu Feb 22 03:44:53 2001
+++ 2.4.2aa/arch/i386/mm/fault.cFri Feb 23 21:15:53 2001
@@ -326,23 +326,19 @@
int offset = __pgd_offset(address);
pgd_t *pgd, *pgd_k;
pmd_t *pmd, *pmd_k;
+   unsigned long flags;
 
pgd = tsk-active_mm-pgd + offset;
pgd_k = init_mm.pgd + offset;
 
-   if (!pgd_present(*pgd)) {
-   if (!pgd_present(*pgd_k))
-   goto bad_area_nosemaphore;
+   __save_flags(flags);
+   __cli();
+   if (!pgd_present(*pgd)  pgd_present(*pgd_k)) {
set_pgd(pgd, *pgd_k);
+   __restore_flags(flags);
return;
}
-
-   pmd = pmd_offset(pgd, address);
-   pmd_k = pmd_offset(pgd_k, address);
-
-   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
-   goto bad_area_nosemaphore;
-   set_pmd(pmd, *pmd_k);
-   return;
+   __restore_flags(flags);
+   goto 

Re: Linux 2.4.1-ac15

2001-02-23 Thread Linus Torvalds



On Fri, 23 Feb 2001, Andrea Arcangeli wrote:
 
 I think that can't happen. Infact I think the whole section:
 
   pmd = pmd_offset(pgd, address);
   pmd_k = pmd_offset(pgd_k, address);
 
   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
   goto bad_area_nosemaphore;
   set_pmd(pmd, *pmd_k);
   return;
 
 is superflows.

No. Think about the differences in PAE and non-PAE mode. 

 The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it
 just points to the as well shared pte), it's only the pgd that is setup lazily

In non-PAE mode, the pgd entry doesn't exist. pgd_present() returns a
unconditional 1. Its' the pmd functions that kick in then.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-23 Thread Andrea Arcangeli

On Fri, Feb 23, 2001 at 01:09:02PM -0800, Linus Torvalds wrote:
 
 
 On Fri, 23 Feb 2001, Andrea Arcangeli wrote:
  
  I think that can't happen. Infact I think the whole section:
  
  pmd = pmd_offset(pgd, address);
  pmd_k = pmd_offset(pgd_k, address);
  
  if (pmd_present(*pmd) || !pmd_present(*pmd_k))
  goto bad_area_nosemaphore;
  set_pmd(pmd, *pmd_k);
  return;
  
  is superflows.
 
 No. Think about the differences in PAE and non-PAE mode. 
 
  The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it
  just points to the as well shared pte), it's only the pgd that is setup lazily
 
 In non-PAE mode, the pgd entry doesn't exist. pgd_present() returns a
 unconditional 1. Its' the pmd functions that kick in then.

Woops, I see, I was thinking only PAE mode indeed 8), sorry. Thanks for the
correction.

All the rest still applies (and the patch still looks fine for PAE mode). I
think I only need to rediff my patch after resurrecting the pmd thing inside
the cli critical section in a #ifndef CONFIG_X86_PAE region.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-23 Thread Andrea Arcangeli

This new one should be better. Using the spinlock for the SMP serialization is
ok because that's an extremely slow path.

diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/fault.c 
2.4.2aa/arch/i386/mm/fault.c
--- 2.4.2/arch/i386/mm/fault.c  Thu Feb 22 03:44:53 2001
+++ 2.4.2aa/arch/i386/mm/fault.cSat Feb 24 05:41:11 2001
@@ -326,23 +326,27 @@
int offset = __pgd_offset(address);
pgd_t *pgd, *pgd_k;
pmd_t *pmd, *pmd_k;
+   static spinlock_t lazy_vmalloc_lock = SPIN_LOCK_UNLOCKED;
+   unsigned long flags;
 
pgd = tsk-active_mm-pgd + offset;
pgd_k = init_mm.pgd + offset;
 
-   if (!pgd_present(*pgd)) {
-   if (!pgd_present(*pgd_k))
-   goto bad_area_nosemaphore;
+   spin_lock_irqsave(lazy_vmalloc_lock, flags);
+   if (!pgd_present(*pgd)  pgd_present(*pgd_k)) {
set_pgd(pgd, *pgd_k);
+   spin_unlock_irqrestore(lazy_vmalloc_lock, flags);
return;
}
-
pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);
 
-   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
-   goto bad_area_nosemaphore;
-   set_pmd(pmd, *pmd_k);
-   return;
+   if (!pmd_present(*pmd)  pmd_present(*pmd_k)) {
+   set_pmd(pmd, *pmd_k);
+   spin_unlock_irqrestore(lazy_vmalloc_lock, flags);
+   return;
+   }
+   spin_unlock_irqrestore(lazy_vmalloc_lock, flags);
+   goto bad_area_nosemaphore;
}
 }
diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/init.c 
2.4.2aa/arch/i386/mm/init.c
--- 2.4.2/arch/i386/mm/init.c   Sat Feb 10 02:34:03 2001
+++ 2.4.2aa/arch/i386/mm/init.c Fri Feb 23 19:09:25 2001
@@ -116,21 +116,13 @@
pte_t *pte;
 
pte = (pte_t *) __get_free_page(GFP_KERNEL);
-   if (pmd_none(*pmd)) {
-   if (pte) {
-   clear_page(pte);
-   set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
-   return pte + offset;
-   }
-   set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table(;
-   return NULL;
+   if (pte) {
+   clear_page(pte);
+   set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
+   return pte + offset;
}
-   free_page((unsigned long)pte);
-   if (pmd_bad(*pmd)) {
-   __handle_bad_pmd_kernel(pmd);
-   return NULL;
-   }
-   return (pte_t *) pmd_page(*pmd) + offset;
+   set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table(;
+   return NULL;
 }
 
 pte_t *get_pte_slow(pmd_t *pmd, unsigned long offset)
diff -urN -X /home/andrea/bin/dontdiff 2.4.2/include/asm-i386/pgalloc-3level.h 
2.4.2aa/include/asm-i386/pgalloc-3level.h
--- 2.4.2/include/asm-i386/pgalloc-3level.h Fri Dec  3 20:12:23 1999
+++ 2.4.2aa/include/asm-i386/pgalloc-3level.h   Fri Feb 23 19:03:14 2001
@@ -53,12 +53,9 @@
if (!page)
page = get_pmd_slow();
if (page) {
-   if (pgd_none(*pgd)) {
-   set_pgd(pgd, __pgd(1 + __pa(page)));
-   __flush_tlb();
-   return page + address;
-   } else
-   free_pmd_fast(page);
+   set_pgd(pgd, __pgd(1 + __pa(page)));
+   __flush_tlb();
+   return page + address;
} else
return NULL;
}


non pae mode is still untested but it should now have a chance to work.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-22 Thread Alan Cox

> >We can take page faults in interrupt handlers in 2.4 so I had to use a 
> >spinlock, but that sounds the same
> 
> Umm? The above doesn't really make sense.
> 
> We can take a page fault on the kernel region with the lazy page
> directory filling, but that code will just set the PGD entry and exit
> without taking any lock at all. So it basically ends up being an
> "invisible" event.

Its only normally invisible. Mark Hemment pointed out there is currently a
race where if both cpus go to fill in the same entry the logic goes

CPU1CPU2

pgd present pgd present
pmd not present
load pmd
pmd present
Explode messily


The race looks right to me since both CPU's can be running from the same
mm.

The obvious fix (removing the 2nd check) of course hangs the WP check. I
have a hack [not for Linus grade] for that now but really need to walk as
far as the pte in the racey case to check for a WP fault.

> 2.4.x. In that case you would take the exception table lock, but that is
> true in both 2.2.x and in 2.4.x.

I didnt say it wasnt 

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-22 Thread Alan Cox

> > We can take page faults in interrupt handlers in 2.4 so I had to use a 
> > spinlock, but that sounds the same
> 
> We can?  Woah, please explain.

vmalloc does a lazy load of the tlb. That can lead to the exception table 
being walked on an IRQ

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-22 Thread Alan Cox

  We can take page faults in interrupt handlers in 2.4 so I had to use a 
  spinlock, but that sounds the same
 
 We can?  Woah, please explain.

vmalloc does a lazy load of the tlb. That can lead to the exception table 
being walked on an IRQ

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-22 Thread Alan Cox

 We can take page faults in interrupt handlers in 2.4 so I had to use a 
 spinlock, but that sounds the same
 
 Umm? The above doesn't really make sense.
 
 We can take a page fault on the kernel region with the lazy page
 directory filling, but that code will just set the PGD entry and exit
 without taking any lock at all. So it basically ends up being an
 "invisible" event.

Its only normally invisible. Mark Hemment pointed out there is currently a
race where if both cpus go to fill in the same entry the logic goes

CPU1CPU2

pgd present pgd present
pmd not present
load pmd
pmd present
Explode messily


The race looks right to me since both CPU's can be running from the same
mm.

The obvious fix (removing the 2nd check) of course hangs the WP check. I
have a hack [not for Linus grade] for that now but really need to walk as
far as the pte in the racey case to check for a WP fault.

 2.4.x. In that case you would take the exception table lock, but that is
 true in both 2.2.x and in 2.4.x.

I didnt say it wasnt 

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-21 Thread Rusty Russell

In message <[EMAIL PROTECTED]> you write:
> > This is a while back, but I thought the solution Philipp and I came up
> > with was to simply used a rw semaphore for this, which was taken (read
> > only) on page fault if we have to scan the exception table.
> 
> We can take page faults in interrupt handlers in 2.4 so I had to use a 
> spinlock, but that sounds the same

We can?  Woah, please explain.

Rusty.
--
Premature optmztion is rt of all evl. --DK
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-21 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Alan Cox  <[EMAIL PROTECTED]> wrote:
>> This is a while back, but I thought the solution Philipp and I came up
>> with was to simply used a rw semaphore for this, which was taken (read
>> only) on page fault if we have to scan the exception table.
>
>We can take page faults in interrupt handlers in 2.4 so I had to use a 
>spinlock, but that sounds the same

Umm? The above doesn't really make sense.

We can take a page fault on the kernel region with the lazy page
directory filling, but that code will just set the PGD entry and exit
without taking any lock at all. So it basically ends up being an
"invisible" event.

Now, if an interrupt handler accesses kernel memory that just isn't
there, that has _always_ taken a page fault, and that case is not new to
2.4.x. In that case you would take the exception table lock, but that is
true in both 2.2.x and in 2.4.x.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-21 Thread Alan Cox

> This is a while back, but I thought the solution Philipp and I came up
> with was to simply used a rw semaphore for this, which was taken (read
> only) on page fault if we have to scan the exception table.

We can take page faults in interrupt handlers in 2.4 so I had to use a 
spinlock, but that sounds the same

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-21 Thread Alan Cox

 This is a while back, but I thought the solution Philipp and I came up
 with was to simply used a rw semaphore for this, which was taken (read
 only) on page fault if we have to scan the exception table.

We can take page faults in interrupt handlers in 2.4 so I had to use a 
spinlock, but that sounds the same

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-21 Thread Linus Torvalds

In article [EMAIL PROTECTED],
Alan Cox  [EMAIL PROTECTED] wrote:
 This is a while back, but I thought the solution Philipp and I came up
 with was to simply used a rw semaphore for this, which was taken (read
 only) on page fault if we have to scan the exception table.

We can take page faults in interrupt handlers in 2.4 so I had to use a 
spinlock, but that sounds the same

Umm? The above doesn't really make sense.

We can take a page fault on the kernel region with the lazy page
directory filling, but that code will just set the PGD entry and exit
without taking any lock at all. So it basically ends up being an
"invisible" event.

Now, if an interrupt handler accesses kernel memory that just isn't
there, that has _always_ taken a page fault, and that case is not new to
2.4.x. In that case you would take the exception table lock, but that is
true in both 2.2.x and in 2.4.x.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-21 Thread Rusty Russell

In message [EMAIL PROTECTED] you write:
  This is a while back, but I thought the solution Philipp and I came up
  with was to simply used a rw semaphore for this, which was taken (read
  only) on page fault if we have to scan the exception table.
 
 We can take page faults in interrupt handlers in 2.4 so I had to use a 
 spinlock, but that sounds the same

We can?  Woah, please explain.

Rusty.
--
Premature optmztion is rt of all evl. --DK
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-20 Thread Rusty Russell

In message <[EMAIL PROTECTED]
.com> you write:
> > We unlink the module
> > We free the memory
> > 
> > At the same time another cpu may be walking the exception table that we fre
e.
> 
> True.
> 
> Rusty had a patch that locked the module list properly IIRC.

This is a while back, but I thought the solution Philipp and I came up
with was to simply used a rw semaphore for this, which was taken (read
only) on page fault if we have to scan the exception table.

Rusty.
--
Premature optmztion is rt of all evl. --DK
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-20 Thread Philipp Rumpf

On Tue, 20 Feb 2001, Keith Owens wrote:
> On Mon, 19 Feb 2001 16:04:07 + (GMT), 
> Alan Cox <[EMAIL PROTECTED]> wrote:
> Using wait_for_at_least_one_schedule_on_every_cpu() has no penalty
> except on the process running rmmod.  It does not require yet more
> spinlocks and is architecture independent.  Since schedule() updates
> sched_data->last_schedule, all the rmmod process has to do is wait
> until the last_schedule value changes on all cpus, forcing a reschedule
> if necessary.
> 
> Zero overhead in schedule, zero overhead in exception handling, zero
> overhead in IA64 unwind code, architecture independent.  Sounds good to
> me.

Not architecture independent unfortunately.  get_cycles() always returns 0
on some SMP-capable machines.

->last_schedule doesn't change if one CPU is always idle (kernel threads
do), or always running the same process (kernel threads do, unless it's an
RT process in an endless loop).  I'm not sure how you'd go about "forcing
a reschedule if necessary".

Using temporary kernel threads has zero overhead in {schedule, exception
handling, IA64 unwind code} and actually works on all architectures.  It
adds overhead to the wait_for_at_least_one_schedule_on_every_cpu() code,
but I think that's acceptable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-20 Thread Philipp Rumpf

On Tue, 20 Feb 2001, Keith Owens wrote:
 On Mon, 19 Feb 2001 16:04:07 + (GMT), 
 Alan Cox [EMAIL PROTECTED] wrote:
 Using wait_for_at_least_one_schedule_on_every_cpu() has no penalty
 except on the process running rmmod.  It does not require yet more
 spinlocks and is architecture independent.  Since schedule() updates
 sched_data-last_schedule, all the rmmod process has to do is wait
 until the last_schedule value changes on all cpus, forcing a reschedule
 if necessary.
 
 Zero overhead in schedule, zero overhead in exception handling, zero
 overhead in IA64 unwind code, architecture independent.  Sounds good to
 me.

Not architecture independent unfortunately.  get_cycles() always returns 0
on some SMP-capable machines.

-last_schedule doesn't change if one CPU is always idle (kernel threads
do), or always running the same process (kernel threads do, unless it's an
RT process in an endless loop).  I'm not sure how you'd go about "forcing
a reschedule if necessary".

Using temporary kernel threads has zero overhead in {schedule, exception
handling, IA64 unwind code} and actually works on all architectures.  It
adds overhead to the wait_for_at_least_one_schedule_on_every_cpu() code,
but I think that's acceptable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-20 Thread Rusty Russell

In message [EMAIL PROTECTED]
.com you write:
  We unlink the module
  We free the memory
  
  At the same time another cpu may be walking the exception table that we fre
e.
 
 True.
 
 Rusty had a patch that locked the module list properly IIRC.

This is a while back, but I thought the solution Philipp and I came up
with was to simply used a rw semaphore for this, which was taken (read
only) on page fault if we have to scan the exception table.

Rusty.
--
Premature optmztion is rt of all evl. --DK
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Keith Owens

On Mon, 19 Feb 2001 16:04:07 + (GMT), 
Alan Cox <[EMAIL PROTECTED]> wrote:
>My spinlock based fix has almost no contention and doesnt require 64 processors
>grind to a halt on a big machine just to handle a module list change. Sorry
>I don't think it supports your argument

I am not proposing that the other cpus grind to a halt.  On the
contrary, they are allowed to continue running.  It is the current
process that is suspended until all other cpus have gone through at
least one schedule.

On Mon, 19 Feb 2001 16:23:09 +0100, 
Manfred Spraul <[EMAIL PROTECTED]> wrote:
>what about
>
>   spin_unlock_wait(_exception_lock);
>
>The actual exception table waker uses
>   spin_lock_irqsave(_exception_lock);
>
>   spin_unlock_irqsave(_exception_lock);
>
>Or a simple spinlock - the code shouldn't be performance critical.

All lock based fixes have the disadvantage that you penalise the entire
kernel all the time.  No matter how small the overhead of getting the
lock, it still exists - so we are slowing down the main kernel all the
time to handle the rare case of module unloading.

Also notice that we keep adding spinlocks.  One for the main module
list, another for the exception tables.  Then there is the architecture
specific module data, including unwind information for IA64; that also
needs to be locked.  Requiring more than one spinlock to handle module
unloading tells me that the design is wrong.

Using wait_for_at_least_one_schedule_on_every_cpu() has no penalty
except on the process running rmmod.  It does not require yet more
spinlocks and is architecture independent.  Since schedule() updates
sched_data->last_schedule, all the rmmod process has to do is wait
until the last_schedule value changes on all cpus, forcing a reschedule
if necessary.

Zero overhead in schedule, zero overhead in exception handling, zero
overhead in IA64 unwind code, architecture independent.  Sounds good to
me.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Keith Owens

On Mon, 19 Feb 2001 14:25:39 +0100, 
Christoph Hellwig <[EMAIL PROTECTED]> wrote:
>You just reinvented the read-copy-update model
>(http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html)...
>
>The mail proposing that locking model for module unloading is not yet
>in the arvhices, sorry.

I did not reinvent it, I was following up on the discussion we already
had off list about using this model.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

> You can have exceptions while initializing.  not
> MOD_RUNNING|MOD_INITIALIZING should work though.

True.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Alan Cox wrote:
> > So you fixed the nonexistent race only.  The real race is that the module
> 
> Umm I fixed the small race. You are right that there is a second race.

There's just one race.  The small race is nonexistent since
put_mod_name always acts as a memory barrier.

> > uninitialized vmalloc'd (module_map'd) memory), then the module data
> > (including the exception table) gets copied.
> > The race window is from the first copy_from_user in sys_init_module until
> > the second one.
> 
> Yep. Obvious answer. Ignore exception tables for modules that are not
> MOD_RUNNING.

You can have exceptions while initializing.  not
MOD_RUNNING|MOD_INITIALIZING should work though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

> So you fixed the nonexistent race only.  The real race is that the module

Umm I fixed the small race. You are right that there is a second race.

> uninitialized vmalloc'd (module_map'd) memory), then the module data
> (including the exception table) gets copied.
> The race window is from the first copy_from_user in sys_init_module until
> the second one.

Yep. Obvious answer. Ignore exception tables for modules that are not
MOD_RUNNING.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Alan Cox wrote:
> > so you hold a spinlock during copy_from_user ?  Or did you change
> > sys_init_module/sys_create_modules semantics ?
> 
> The only points I need to hold a spinlock are editing the chain and
> walking it in a case where I dont hold the kernel lock.
> 
> So I spin_lock_irqsave the two ops updating the list links in each case and
> the exception lookup walk

So you fixed the nonexistent race only.  The real race is that the module
structure gets initialized first (so the exception table pointers point to
uninitialized vmalloc'd (module_map'd) memory), then the module data
(including the exception table) gets copied.

The race window is from the first copy_from_user in sys_init_module until
the second one.

Or am I missing something ?




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

> so you hold a spinlock during copy_from_user ?  Or did you change
> sys_init_module/sys_create_modules semantics ?

The only points I need to hold a spinlock are editing the chain and
walking it in a case where I dont hold the kernel lock.

So I spin_lock_irqsave the two ops updating the list links in each case and
the exception lookup walk

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Alan Cox wrote:
> > Rusty had a patch that locked the module list properly IIRC.
> 
> So does -ac now. I just take a spinlock for the modify cases that race
> against faults (including vmalloc faults from irq context)

so you hold a spinlock during copy_from_user ?  Or did you change
sys_init_module/sys_create_modules semantics ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

> >At the same time another cpu may be walking the exception table that we free.
> 
> Another good reason why locking modules via use counts from external
> code is not the right fix.  We definitely need a quiesce model for
> module removal.

My spinlock based fix has almost no contention and doesnt require 64 processors
grind to a halt on a big machine just to handle a module list change. Sorry
I don't think it supports your argument

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

> Rusty had a patch that locked the module list properly IIRC.

So does -ac now. I just take a spinlock for the modify cases that race
against faults (including vmalloc faults from irq context)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Manfred Spraul

Keith Owens wrote:
>  wait_for_at_least_one_schedule_on_every_cpu();

what about

spin_unlock_wait(_exception_lock);

The actual exception table waker uses
spin_lock_irqsave(_exception_lock);

spin_unlock_irqsave(_exception_lock);

Or a simple spinlock - the code shouldn't be performance critical.
--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Tue, 20 Feb 2001, Keith Owens wrote:
> On Mon, 19 Feb 2001 06:15:22 -0600 (CST), 
> Philipp Rumpf <[EMAIL PROTECTED]> wrote:
> No need for a callin routine, you can get this for free as part of
> normal scheduling.  The sequence goes :-
> 
> if (use_count == 0) {
>   module_unregister();
>   wait_for_at_least_one_schedule_on_every_cpu();
>   if (use_count != 0) {
> module_register();/* lost the unregister race */
>   }
>   else {
> /* nobody can enter the module now */
> module_release_resources();
> unlink_module_from_list();
> wait_for_at_least_one_schedule_on_every_cpu();
> free_module_storage();
>   }
> }
> 
> wait_for_at_least_one_schedule_on_every_cpu() prevents the next

wait_for_at_least_one_schedule_on_every_cpu() *is* callin_other_cpus().
I agree the name isn't optimal.  (and yes, you could implement it by
hacking sched.c directly, but I don't think that's necessary).

> The beauty of this approach is that the rest of the cpus can do normal
> work.  No need to bring everything to a dead stop.

Which nicely avoids potential deadlocks in modules that need to initialize
on all CPUs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Christoph Hellwig

In article <[EMAIL PROTECTED]> you wrote:
> No need for a callin routine, you can get this for free as part of
> normal scheduling.  The sequence goes :-
>
> if (use_count == 0) {
>   module_unregister();
>   wait_for_at_least_one_schedule_on_every_cpu();
>   if (use_count != 0) {
> module_register();/* lost the unregister race */
>   }
>   else {
> /* nobody can enter the module now */
> module_release_resources();
> unlink_module_from_list();
> wait_for_at_least_one_schedule_on_every_cpu();
> free_module_storage();
>   }
> }
>
> wait_for_at_least_one_schedule_on_every_cpu() prevents the next
> operation until at least one schedule has been executed on every cpu.
> Whether this is done as a call back or a separate kernel thread that
> schedules itself on every cpu or the current process scheduling itself
> on every cpu is an implementation detail.  All that matters is that any
> other cpu that might have been accessing the module has gone through
> schedule and therefore is no longer accessing the module's data or
> code.

You just reinvented the read-copy-update model
(http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html)...

The mail proposing that locking model for module unloading is not yet
in the arvhices, sorry.

Christoph

P.S. Weren't you Cc:'ed on that mail?
-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Keith Owens

On Mon, 19 Feb 2001 06:15:22 -0600 (CST), 
Philipp Rumpf <[EMAIL PROTECTED]> wrote:
>Unless I'm mistaken, we need both use counts and SMP magic (though not
>necessarily as extreme as what the "freeze all other CPUs during module
>unload" patch did).
>
>I think something like this would work (in addition to use counts)
>
>int callin_func(void *p)
>{
>   int *cpu = p;
>
>   while (*cpu != smp_processor_id()) {
>   current->cpus_allowed = 1 << *cpu;
>   schedule();
>   }
>
>   return 0;
>}
>
>void callin_other_cpus(void)
>{
>   int cpus[smp_num_cpus];
>   int i;
>
>   for (i=0; i   cpus[i] = i;
>
>   kernel_thread(callin_func, [i], ...);
>   }
>}
>
>and call callin_other_cpus() before unloading a module.
>
>I'm not sure how you could make exception handling safe without locking
>all accesses to the module list - but that sounds like the sane thing to
>do anyway.

No need for a callin routine, you can get this for free as part of
normal scheduling.  The sequence goes :-

if (use_count == 0) {
  module_unregister();
  wait_for_at_least_one_schedule_on_every_cpu();
  if (use_count != 0) {
module_register();  /* lost the unregister race */
  }
  else {
/* nobody can enter the module now */
module_release_resources();
unlink_module_from_list();
wait_for_at_least_one_schedule_on_every_cpu();
free_module_storage();
  }
}

wait_for_at_least_one_schedule_on_every_cpu() prevents the next
operation until at least one schedule has been executed on every cpu.
Whether this is done as a call back or a separate kernel thread that
schedules itself on every cpu or the current process scheduling itself
on every cpu is an implementation detail.  All that matters is that any
other cpu that might have been accessing the module has gone through
schedule and therefore is no longer accessing the module's data or
code.

The beauty of this approach is that the rest of the cpus can do normal
work.  No need to bring everything to a dead stop.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Keith Owens wrote:
> On Mon, 19 Feb 2001 11:35:08 + (GMT), 
> Alan Cox <[EMAIL PROTECTED]> wrote:
> >The problem isnt running module code. What happens in this case
> >
> >mod->next = module_list;
> >module_list = mod;  /* link it in */
> >
> >Note no write barrier.
> 
> It works on ix86 so the code must be right

Too bad it doesn't.

> >Delete is even worse
> >
> >We unlink the module
> >We free the memory
> >
> >At the same time another cpu may be walking the exception table that we free.
> 
> Another good reason why locking modules via use counts from external
> code is not the right fix.  We definitely need a quiesce model for
> module removal.

Unless I'm mistaken, we need both use counts and SMP magic (though not
necessarily as extreme as what the "freeze all other CPUs during module
unload" patch did).


I think something like this would work (in addition to use counts)

int callin_func(void *p)
{
int *cpu = p;

while (*cpu != smp_processor_id()) {
current->cpus_allowed = 1 << *cpu;
schedule();
}

return 0;
}

void callin_other_cpus(void)
{
int cpus[smp_num_cpus];
int i;

for (i=0; ihttp://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Keith Owens

On Mon, 19 Feb 2001 11:35:08 + (GMT), 
Alan Cox <[EMAIL PROTECTED]> wrote:
>The problem isnt running module code. What happens in this case
>
>mod->next = module_list;
>module_list = mod;  /* link it in */
>
>Note no write barrier.

It works on ix86 so the code must be right

>Delete is even worse
>
>We unlink the module
>We free the memory
>
>At the same time another cpu may be walking the exception table that we free.

Another good reason why locking modules via use counts from external
code is not the right fix.  We definitely need a quiesce model for
module removal.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Alan Cox wrote:
> mod->next = module_list;

put_mod_name() here.

> module_list = mod;  /* link it in */
> 
> Note no write barrier.

put_mod_name calls free_page which always implies a memory barrier.  This
isn't beautiful but it won't blow up either.

Actually that isn't relevant since the actual table is copied _after_
ex_table_{start,end}.  We'll scan uninitialized memory and if some word
happens to match the fault EIP we jump to the bogus fixup.

> Delete is even worse
> 
> We unlink the module
> We free the memory
> 
> At the same time another cpu may be walking the exception table that we free.

True.

Rusty had a patch that locked the module list properly IIRC.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

> The module list is modified atomically, so either we search the new table
> or we don't, but we never see intermediate states.  Not searching the new
> table shouldn't be a problem as we shouldn't run module code until
> sys_init_module time.

The problem isnt running module code. What happens in this case

mod->next = module_list;
module_list = mod;  /* link it in */

Note no write barrier.

Delete is even worse

We unlink the module
We free the memory

At the same time another cpu may be walking the exception table that we free.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Thu, 15 Feb 2001, Alan Cox wrote:
>   Question of the day for the VM folks:
>   If CPU1 is loading the exception tables for a module and
>   CPU2 faults.. what happens 8)

"loading" == in sys_create_module ?

The module list is modified atomically, so either we search the new table
or we don't, but we never see intermediate states.  Not searching the new
table shouldn't be a problem as we shouldn't run module code until
sys_init_module time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Thu, 15 Feb 2001, Alan Cox wrote:
   Question of the day for the VM folks:
   If CPU1 is loading the exception tables for a module and
   CPU2 faults.. what happens 8)

"loading" == in sys_create_module ?

The module list is modified atomically, so either we search the new table
or we don't, but we never see intermediate states.  Not searching the new
table shouldn't be a problem as we shouldn't run module code until
sys_init_module time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

 The module list is modified atomically, so either we search the new table
 or we don't, but we never see intermediate states.  Not searching the new
 table shouldn't be a problem as we shouldn't run module code until
 sys_init_module time.

The problem isnt running module code. What happens in this case

mod-next = module_list;
module_list = mod;  /* link it in */

Note no write barrier.

Delete is even worse

We unlink the module
We free the memory

At the same time another cpu may be walking the exception table that we free.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Alan Cox wrote:
 mod-next = module_list;

put_mod_name() here.

 module_list = mod;  /* link it in */
 
 Note no write barrier.

put_mod_name calls free_page which always implies a memory barrier.  This
isn't beautiful but it won't blow up either.

Actually that isn't relevant since the actual table is copied _after_
ex_table_{start,end}.  We'll scan uninitialized memory and if some word
happens to match the fault EIP we jump to the bogus fixup.

 Delete is even worse
 
 We unlink the module
 We free the memory
 
 At the same time another cpu may be walking the exception table that we free.

True.

Rusty had a patch that locked the module list properly IIRC.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Keith Owens

On Mon, 19 Feb 2001 11:35:08 + (GMT), 
Alan Cox [EMAIL PROTECTED] wrote:
The problem isnt running module code. What happens in this case

mod-next = module_list;
module_list = mod;  /* link it in */

Note no write barrier.

humourIt works on ix86 so the code must be right/humour

Delete is even worse

We unlink the module
We free the memory

At the same time another cpu may be walking the exception table that we free.

Another good reason why locking modules via use counts from external
code is not the right fix.  We definitely need a quiesce model for
module removal.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Keith Owens wrote:
 On Mon, 19 Feb 2001 11:35:08 + (GMT), 
 Alan Cox [EMAIL PROTECTED] wrote:
 The problem isnt running module code. What happens in this case
 
 mod-next = module_list;
 module_list = mod;  /* link it in */
 
 Note no write barrier.
 
 humourIt works on ix86 so the code must be right/humour

Too bad it doesn't.

 Delete is even worse
 
 We unlink the module
 We free the memory
 
 At the same time another cpu may be walking the exception table that we free.
 
 Another good reason why locking modules via use counts from external
 code is not the right fix.  We definitely need a quiesce model for
 module removal.

Unless I'm mistaken, we need both use counts and SMP magic (though not
necessarily as extreme as what the "freeze all other CPUs during module
unload" patch did).


I think something like this would work (in addition to use counts)

int callin_func(void *p)
{
int *cpu = p;

while (*cpu != smp_processor_id()) {
current-cpus_allowed = 1  *cpu;
schedule();
}

return 0;
}

void callin_other_cpus(void)
{
int cpus[smp_num_cpus];
int i;

for (i=0; ismp_num_cpus; i++) {
cpus[i] = i;

kernel_thread(callin_func, cpus[i], ...);
}
}

and call callin_other_cpus() before unloading a module.


I'm not sure how you could make exception handling safe without locking
all accesses to the module list - but that sounds like the sane thing to
do anyway.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Keith Owens

On Mon, 19 Feb 2001 06:15:22 -0600 (CST), 
Philipp Rumpf [EMAIL PROTECTED] wrote:
Unless I'm mistaken, we need both use counts and SMP magic (though not
necessarily as extreme as what the "freeze all other CPUs during module
unload" patch did).

I think something like this would work (in addition to use counts)

int callin_func(void *p)
{
   int *cpu = p;

   while (*cpu != smp_processor_id()) {
   current-cpus_allowed = 1  *cpu;
   schedule();
   }

   return 0;
}

void callin_other_cpus(void)
{
   int cpus[smp_num_cpus];
   int i;

   for (i=0; ismp_num_cpus; i++) {
   cpus[i] = i;

   kernel_thread(callin_func, cpus[i], ...);
   }
}

and call callin_other_cpus() before unloading a module.

I'm not sure how you could make exception handling safe without locking
all accesses to the module list - but that sounds like the sane thing to
do anyway.

No need for a callin routine, you can get this for free as part of
normal scheduling.  The sequence goes :-

if (use_count == 0) {
  module_unregister();
  wait_for_at_least_one_schedule_on_every_cpu();
  if (use_count != 0) {
module_register();  /* lost the unregister race */
  }
  else {
/* nobody can enter the module now */
module_release_resources();
unlink_module_from_list();
wait_for_at_least_one_schedule_on_every_cpu();
free_module_storage();
  }
}

wait_for_at_least_one_schedule_on_every_cpu() prevents the next
operation until at least one schedule has been executed on every cpu.
Whether this is done as a call back or a separate kernel thread that
schedules itself on every cpu or the current process scheduling itself
on every cpu is an implementation detail.  All that matters is that any
other cpu that might have been accessing the module has gone through
schedule and therefore is no longer accessing the module's data or
code.

The beauty of this approach is that the rest of the cpus can do normal
work.  No need to bring everything to a dead stop.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Christoph Hellwig

In article [EMAIL PROTECTED] you wrote:
 No need for a callin routine, you can get this for free as part of
 normal scheduling.  The sequence goes :-

 if (use_count == 0) {
   module_unregister();
   wait_for_at_least_one_schedule_on_every_cpu();
   if (use_count != 0) {
 module_register();/* lost the unregister race */
   }
   else {
 /* nobody can enter the module now */
 module_release_resources();
 unlink_module_from_list();
 wait_for_at_least_one_schedule_on_every_cpu();
 free_module_storage();
   }
 }

 wait_for_at_least_one_schedule_on_every_cpu() prevents the next
 operation until at least one schedule has been executed on every cpu.
 Whether this is done as a call back or a separate kernel thread that
 schedules itself on every cpu or the current process scheduling itself
 on every cpu is an implementation detail.  All that matters is that any
 other cpu that might have been accessing the module has gone through
 schedule and therefore is no longer accessing the module's data or
 code.

You just reinvented the read-copy-update model
(http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html)...

The mail proposing that locking model for module unloading is not yet
in the arvhices, sorry.

Christoph

P.S. Weren't you Cc:'ed on that mail?
-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Tue, 20 Feb 2001, Keith Owens wrote:
 On Mon, 19 Feb 2001 06:15:22 -0600 (CST), 
 Philipp Rumpf [EMAIL PROTECTED] wrote:
 No need for a callin routine, you can get this for free as part of
 normal scheduling.  The sequence goes :-
 
 if (use_count == 0) {
   module_unregister();
   wait_for_at_least_one_schedule_on_every_cpu();
   if (use_count != 0) {
 module_register();/* lost the unregister race */
   }
   else {
 /* nobody can enter the module now */
 module_release_resources();
 unlink_module_from_list();
 wait_for_at_least_one_schedule_on_every_cpu();
 free_module_storage();
   }
 }
 
 wait_for_at_least_one_schedule_on_every_cpu() prevents the next

wait_for_at_least_one_schedule_on_every_cpu() *is* callin_other_cpus().
I agree the name isn't optimal.  (and yes, you could implement it by
hacking sched.c directly, but I don't think that's necessary).

 The beauty of this approach is that the rest of the cpus can do normal
 work.  No need to bring everything to a dead stop.

Which nicely avoids potential deadlocks in modules that need to initialize
on all CPUs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Manfred Spraul

Keith Owens wrote:
  wait_for_at_least_one_schedule_on_every_cpu();

what about

spin_unlock_wait(global_exception_lock);

The actual exception table waker uses
spin_lock_irqsave(global_exception_lock);

spin_unlock_irqsave(global_exception_lock);

Or a simple spinlock - the code shouldn't be performance critical.
--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

 Rusty had a patch that locked the module list properly IIRC.

So does -ac now. I just take a spinlock for the modify cases that race
against faults (including vmalloc faults from irq context)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

 At the same time another cpu may be walking the exception table that we free.
 
 Another good reason why locking modules via use counts from external
 code is not the right fix.  We definitely need a quiesce model for
 module removal.

My spinlock based fix has almost no contention and doesnt require 64 processors
grind to a halt on a big machine just to handle a module list change. Sorry
I don't think it supports your argument

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Alan Cox wrote:
  Rusty had a patch that locked the module list properly IIRC.
 
 So does -ac now. I just take a spinlock for the modify cases that race
 against faults (including vmalloc faults from irq context)

so you hold a spinlock during copy_from_user ?  Or did you change
sys_init_module/sys_create_modules semantics ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

 so you hold a spinlock during copy_from_user ?  Or did you change
 sys_init_module/sys_create_modules semantics ?

The only points I need to hold a spinlock are editing the chain and
walking it in a case where I dont hold the kernel lock.

So I spin_lock_irqsave the two ops updating the list links in each case and
the exception lookup walk

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Alan Cox wrote:
  so you hold a spinlock during copy_from_user ?  Or did you change
  sys_init_module/sys_create_modules semantics ?
 
 The only points I need to hold a spinlock are editing the chain and
 walking it in a case where I dont hold the kernel lock.
 
 So I spin_lock_irqsave the two ops updating the list links in each case and
 the exception lookup walk

So you fixed the nonexistent race only.  The real race is that the module
structure gets initialized first (so the exception table pointers point to
uninitialized vmalloc'd (module_map'd) memory), then the module data
(including the exception table) gets copied.

The race window is from the first copy_from_user in sys_init_module until
the second one.

Or am I missing something ?




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

 So you fixed the nonexistent race only.  The real race is that the module

Umm I fixed the small race. You are right that there is a second race.

 uninitialized vmalloc'd (module_map'd) memory), then the module data
 (including the exception table) gets copied.
 The race window is from the first copy_from_user in sys_init_module until
 the second one.

Yep. Obvious answer. Ignore exception tables for modules that are not
MOD_RUNNING.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Philipp Rumpf

On Mon, 19 Feb 2001, Alan Cox wrote:
  So you fixed the nonexistent race only.  The real race is that the module
 
 Umm I fixed the small race. You are right that there is a second race.

There's just one race.  The small race is nonexistent since
put_mod_name always acts as a memory barrier.

  uninitialized vmalloc'd (module_map'd) memory), then the module data
  (including the exception table) gets copied.
  The race window is from the first copy_from_user in sys_init_module until
  the second one.
 
 Yep. Obvious answer. Ignore exception tables for modules that are not
 MOD_RUNNING.

You can have exceptions while initializing.  not
MOD_RUNNING|MOD_INITIALIZING should work though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Alan Cox

 You can have exceptions while initializing.  not
 MOD_RUNNING|MOD_INITIALIZING should work though.

True.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Keith Owens

On Mon, 19 Feb 2001 14:25:39 +0100, 
Christoph Hellwig [EMAIL PROTECTED] wrote:
You just reinvented the read-copy-update model
(http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html)...

The mail proposing that locking model for module unloading is not yet
in the arvhices, sorry.

I did not reinvent it, I was following up on the discussion we already
had off list about using this model.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.1-ac15

2001-02-19 Thread Keith Owens

On Mon, 19 Feb 2001 16:04:07 + (GMT), 
Alan Cox [EMAIL PROTECTED] wrote:
My spinlock based fix has almost no contention and doesnt require 64 processors
grind to a halt on a big machine just to handle a module list change. Sorry
I don't think it supports your argument

I am not proposing that the other cpus grind to a halt.  On the
contrary, they are allowed to continue running.  It is the current
process that is suspended until all other cpus have gone through at
least one schedule.

On Mon, 19 Feb 2001 16:23:09 +0100, 
Manfred Spraul [EMAIL PROTECTED] wrote:
what about

   spin_unlock_wait(global_exception_lock);

The actual exception table waker uses
   spin_lock_irqsave(global_exception_lock);

   spin_unlock_irqsave(global_exception_lock);

Or a simple spinlock - the code shouldn't be performance critical.

All lock based fixes have the disadvantage that you penalise the entire
kernel all the time.  No matter how small the overhead of getting the
lock, it still exists - so we are slowing down the main kernel all the
time to handle the rare case of module unloading.

Also notice that we keep adding spinlocks.  One for the main module
list, another for the exception tables.  Then there is the architecture
specific module data, including unwind information for IA64; that also
needs to be locked.  Requiring more than one spinlock to handle module
unloading tells me that the design is wrong.

Using wait_for_at_least_one_schedule_on_every_cpu() has no penalty
except on the process running rmmod.  It does not require yet more
spinlocks and is architecture independent.  Since schedule() updates
sched_data-last_schedule, all the rmmod process has to do is wait
until the last_schedule value changes on all cpus, forcing a reschedule
if necessary.

Zero overhead in schedule, zero overhead in exception handling, zero
overhead in IA64 unwind code, architecture independent.  Sounds good to
me.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/