Re: Linux 2.4.1-ac15
This new one should be better. Using the spinlock for the SMP serialization is ok because that's an extremely slow path. diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/fault.c 2.4.2aa/arch/i386/mm/fault.c --- 2.4.2/arch/i386/mm/fault.c Thu Feb 22 03:44:53 2001 +++ 2.4.2aa/arch/i386/mm/fault.cSat Feb 24 05:41:11 2001 @@ -326,23 +326,27 @@ int offset = __pgd_offset(address); pgd_t *pgd, *pgd_k; pmd_t *pmd, *pmd_k; + static spinlock_t lazy_vmalloc_lock = SPIN_LOCK_UNLOCKED; + unsigned long flags; pgd = tsk->active_mm->pgd + offset; pgd_k = init_mm.pgd + offset; - if (!pgd_present(*pgd)) { - if (!pgd_present(*pgd_k)) - goto bad_area_nosemaphore; + spin_lock_irqsave(_vmalloc_lock, flags); + if (!pgd_present(*pgd) && pgd_present(*pgd_k)) { set_pgd(pgd, *pgd_k); + spin_unlock_irqrestore(_vmalloc_lock, flags); return; } - pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); - if (pmd_present(*pmd) || !pmd_present(*pmd_k)) - goto bad_area_nosemaphore; - set_pmd(pmd, *pmd_k); - return; + if (!pmd_present(*pmd) && pmd_present(*pmd_k)) { + set_pmd(pmd, *pmd_k); + spin_unlock_irqrestore(_vmalloc_lock, flags); + return; + } + spin_unlock_irqrestore(_vmalloc_lock, flags); + goto bad_area_nosemaphore; } } diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/init.c 2.4.2aa/arch/i386/mm/init.c --- 2.4.2/arch/i386/mm/init.c Sat Feb 10 02:34:03 2001 +++ 2.4.2aa/arch/i386/mm/init.c Fri Feb 23 19:09:25 2001 @@ -116,21 +116,13 @@ pte_t *pte; pte = (pte_t *) __get_free_page(GFP_KERNEL); - if (pmd_none(*pmd)) { - if (pte) { - clear_page(pte); - set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte))); - return pte + offset; - } - set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table(; - return NULL; + if (pte) { + clear_page(pte); + set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte))); + return pte + offset; } - free_page((unsigned long)pte); - if (pmd_bad(*pmd)) { - __handle_bad_pmd_kernel(pmd); - return NULL; - } - return (pte_t *) pmd_page(*pmd) + offset; + set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table(; + return NULL; } pte_t *get_pte_slow(pmd_t *pmd, unsigned long offset) diff -urN -X /home/andrea/bin/dontdiff 2.4.2/include/asm-i386/pgalloc-3level.h 2.4.2aa/include/asm-i386/pgalloc-3level.h --- 2.4.2/include/asm-i386/pgalloc-3level.h Fri Dec 3 20:12:23 1999 +++ 2.4.2aa/include/asm-i386/pgalloc-3level.h Fri Feb 23 19:03:14 2001 @@ -53,12 +53,9 @@ if (!page) page = get_pmd_slow(); if (page) { - if (pgd_none(*pgd)) { - set_pgd(pgd, __pgd(1 + __pa(page))); - __flush_tlb(); - return page + address; - } else - free_pmd_fast(page); + set_pgd(pgd, __pgd(1 + __pa(page))); + __flush_tlb(); + return page + address; } else return NULL; } non pae mode is still untested but it should now have a chance to work. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Fri, Feb 23, 2001 at 01:09:02PM -0800, Linus Torvalds wrote: > > > On Fri, 23 Feb 2001, Andrea Arcangeli wrote: > > > > I think that can't happen. Infact I think the whole section: > > > > pmd = pmd_offset(pgd, address); > > pmd_k = pmd_offset(pgd_k, address); > > > > if (pmd_present(*pmd) || !pmd_present(*pmd_k)) > > goto bad_area_nosemaphore; > > set_pmd(pmd, *pmd_k); > > return; > > > > is superflows. > > No. Think about the differences in PAE and non-PAE mode. > > > The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it > > just points to the as well shared pte), it's only the pgd that is setup lazily > > In non-PAE mode, the pgd entry doesn't exist. pgd_present() returns a > unconditional 1. Its' the pmd functions that kick in then. Woops, I see, I was thinking only PAE mode indeed 8), sorry. Thanks for the correction. All the rest still applies (and the patch still looks fine for PAE mode). I think I only need to rediff my patch after resurrecting the pmd thing inside the cli critical section in a #ifndef CONFIG_X86_PAE region. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Fri, 23 Feb 2001, Andrea Arcangeli wrote: > > I think that can't happen. Infact I think the whole section: > > pmd = pmd_offset(pgd, address); > pmd_k = pmd_offset(pgd_k, address); > > if (pmd_present(*pmd) || !pmd_present(*pmd_k)) > goto bad_area_nosemaphore; > set_pmd(pmd, *pmd_k); > return; > > is superflows. No. Think about the differences in PAE and non-PAE mode. > The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it > just points to the as well shared pte), it's only the pgd that is setup lazily In non-PAE mode, the pgd entry doesn't exist. pgd_present() returns a unconditional 1. Its' the pmd functions that kick in then. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Thu, Feb 22, 2001 at 10:29:58AM +, Alan Cox wrote: > > >We can take page faults in interrupt handlers in 2.4 so I had to use a > > >spinlock, but that sounds the same > > > > Umm? The above doesn't really make sense. > > > > We can take a page fault on the kernel region with the lazy page > > directory filling, but that code will just set the PGD entry and exit > > without taking any lock at all. So it basically ends up being an > > "invisible" event. > > Its only normally invisible. Mark Hemment pointed out there is currently a > race where if both cpus go to fill in the same entry the logic goes > > CPU1CPU2 > > pgd present pgd present > pmd not present > load pmd > pmd present > Explode messily > I think that can't happen. Infact I think the whole section: pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); if (pmd_present(*pmd) || !pmd_present(*pmd_k)) goto bad_area_nosemaphore; set_pmd(pmd, *pmd_k); return; is superflows. The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it just points to the as well shared pte), it's only the pgd that is setup lazily to simplify the locking (locking is simplified of an order of magnitude because for example in set_64bit we don't need the lock on the bus but we just need to do the 64bit move in a single instruction so we can't be interrupted in the middle by an irq and all sort of similar issues that now becomes serialized inside the local CPU). So in short unless I'm misreading something I think all the vmalloc faults will be handled by the: if (!pgd_present(*pgd)) { if (!pgd_present(*pgd_k)) goto bad_area_nosemaphore; set_pgd(pgd, *pgd_k); return; } Said that there are a couple of other issues with the vmalloc pgd lazy handling but nothing specific to SMP or threads. 1) we can triple fault (page_fault -> irq -> vmalloc fault -> irq -> vmalloc fault) and I'm not sure if the cpu will reset after that because the three faults are interleaved with two irq exceptions, and it's not so easy to find the answer empirically, and the documentation doesn't specify that apparently. If it resets (it shouldn't if the cpu was well designed) then the whole vmalloc lazy design will be hardly fixable. 2) we could race with irqs locally to the CPU this way during the vmalloc handler: irq handler vmalloc_fault nested irq_handler vmalloc fault pgd is not present set_pgd iret pgd is present enter the pmd business and find the pmd is just set goto error and I suspect this is actually the bug triggered by Mark. 3) offtopic with the vmalloc thing but I also noticed in some place during the pgd/pmd/pte allocations we re-check that nobody mapped in the pagetable from under us. But in 2.4 we don't hold the big lock so sleeping or not sleeping during allocations is meaningless w.r.t. serialization, we rely only the semaphore and on the fact other tasks can't play with our pagetables (one of the reason dropping set_pgdir simplifys the locking). So I did this patch that should be the correct cure for the vmalloc pgd irq race (2) and that drops a few double checks that seems unnecessary (2) and hopefully (1) is not an issue and the cpu is smart enough to understand it must not hard reset: diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/fault.c 2.4.2aa/arch/i386/mm/fault.c --- 2.4.2/arch/i386/mm/fault.c Thu Feb 22 03:44:53 2001 +++ 2.4.2aa/arch/i386/mm/fault.cFri Feb 23 21:15:53 2001 @@ -326,23 +326,19 @@ int offset = __pgd_offset(address); pgd_t *pgd, *pgd_k; pmd_t *pmd, *pmd_k; + unsigned long flags; pgd = tsk->active_mm->pgd + offset; pgd_k = init_mm.pgd + offset; - if (!pgd_present(*pgd)) { - if (!pgd_present(*pgd_k)) - goto bad_area_nosemaphore; + __save_flags(flags); + __cli(); + if (!pgd_present(*pgd) && pgd_present(*pgd_k)) { set_pgd(pgd, *pgd_k); + __restore_flags(flags); return; } - - pmd = pmd_offset(pgd, address); - pmd_k = pmd_offset(pgd_k, address); - - if (pmd_present(*pmd) || !pmd_present(*pmd_k)) - goto bad_area_nosemaphore; - set_pmd(pmd, *pmd_k); - return; +
Re: Linux 2.4.1-ac15
On Thu, Feb 22, 2001 at 10:29:58AM +, Alan Cox wrote: We can take page faults in interrupt handlers in 2.4 so I had to use a spinlock, but that sounds the same Umm? The above doesn't really make sense. We can take a page fault on the kernel region with the lazy page directory filling, but that code will just set the PGD entry and exit without taking any lock at all. So it basically ends up being an "invisible" event. Its only normally invisible. Mark Hemment pointed out there is currently a race where if both cpus go to fill in the same entry the logic goes CPU1CPU2 pgd present pgd present pmd not present load pmd pmd present Explode messily I think that can't happen. Infact I think the whole section: pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); if (pmd_present(*pmd) || !pmd_present(*pmd_k)) goto bad_area_nosemaphore; set_pmd(pmd, *pmd_k); return; is superflows. The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it just points to the as well shared pte), it's only the pgd that is setup lazily to simplify the locking (locking is simplified of an order of magnitude because for example in set_64bit we don't need the lock on the bus but we just need to do the 64bit move in a single instruction so we can't be interrupted in the middle by an irq and all sort of similar issues that now becomes serialized inside the local CPU). So in short unless I'm misreading something I think all the vmalloc faults will be handled by the: if (!pgd_present(*pgd)) { if (!pgd_present(*pgd_k)) goto bad_area_nosemaphore; set_pgd(pgd, *pgd_k); return; } Said that there are a couple of other issues with the vmalloc pgd lazy handling but nothing specific to SMP or threads. 1) we can triple fault (page_fault - irq - vmalloc fault - irq - vmalloc fault) and I'm not sure if the cpu will reset after that because the three faults are interleaved with two irq exceptions, and it's not so easy to find the answer empirically, and the documentation doesn't specify that apparently. If it resets (it shouldn't if the cpu was well designed) then the whole vmalloc lazy design will be hardly fixable. 2) we could race with irqs locally to the CPU this way during the vmalloc handler: irq handler vmalloc_fault nested irq_handler vmalloc fault pgd is not present set_pgd iret pgd is present enter the pmd business and find the pmd is just set goto error and I suspect this is actually the bug triggered by Mark. 3) offtopic with the vmalloc thing but I also noticed in some place during the pgd/pmd/pte allocations we re-check that nobody mapped in the pagetable from under us. But in 2.4 we don't hold the big lock so sleeping or not sleeping during allocations is meaningless w.r.t. serialization, we rely only the semaphore and on the fact other tasks can't play with our pagetables (one of the reason dropping set_pgdir simplifys the locking). So I did this patch that should be the correct cure for the vmalloc pgd irq race (2) and that drops a few double checks that seems unnecessary (2) and hopefully (1) is not an issue and the cpu is smart enough to understand it must not hard reset: diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/fault.c 2.4.2aa/arch/i386/mm/fault.c --- 2.4.2/arch/i386/mm/fault.c Thu Feb 22 03:44:53 2001 +++ 2.4.2aa/arch/i386/mm/fault.cFri Feb 23 21:15:53 2001 @@ -326,23 +326,19 @@ int offset = __pgd_offset(address); pgd_t *pgd, *pgd_k; pmd_t *pmd, *pmd_k; + unsigned long flags; pgd = tsk-active_mm-pgd + offset; pgd_k = init_mm.pgd + offset; - if (!pgd_present(*pgd)) { - if (!pgd_present(*pgd_k)) - goto bad_area_nosemaphore; + __save_flags(flags); + __cli(); + if (!pgd_present(*pgd) pgd_present(*pgd_k)) { set_pgd(pgd, *pgd_k); + __restore_flags(flags); return; } - - pmd = pmd_offset(pgd, address); - pmd_k = pmd_offset(pgd_k, address); - - if (pmd_present(*pmd) || !pmd_present(*pmd_k)) - goto bad_area_nosemaphore; - set_pmd(pmd, *pmd_k); - return; + __restore_flags(flags); + goto
Re: Linux 2.4.1-ac15
On Fri, 23 Feb 2001, Andrea Arcangeli wrote: I think that can't happen. Infact I think the whole section: pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); if (pmd_present(*pmd) || !pmd_present(*pmd_k)) goto bad_area_nosemaphore; set_pmd(pmd, *pmd_k); return; is superflows. No. Think about the differences in PAE and non-PAE mode. The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it just points to the as well shared pte), it's only the pgd that is setup lazily In non-PAE mode, the pgd entry doesn't exist. pgd_present() returns a unconditional 1. Its' the pmd functions that kick in then. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Fri, Feb 23, 2001 at 01:09:02PM -0800, Linus Torvalds wrote: On Fri, 23 Feb 2001, Andrea Arcangeli wrote: I think that can't happen. Infact I think the whole section: pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); if (pmd_present(*pmd) || !pmd_present(*pmd_k)) goto bad_area_nosemaphore; set_pmd(pmd, *pmd_k); return; is superflows. No. Think about the differences in PAE and non-PAE mode. The middle pagetable is shared and the pmd_t entry is set by vmalloc itself (it just points to the as well shared pte), it's only the pgd that is setup lazily In non-PAE mode, the pgd entry doesn't exist. pgd_present() returns a unconditional 1. Its' the pmd functions that kick in then. Woops, I see, I was thinking only PAE mode indeed 8), sorry. Thanks for the correction. All the rest still applies (and the patch still looks fine for PAE mode). I think I only need to rediff my patch after resurrecting the pmd thing inside the cli critical section in a #ifndef CONFIG_X86_PAE region. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
This new one should be better. Using the spinlock for the SMP serialization is ok because that's an extremely slow path. diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/fault.c 2.4.2aa/arch/i386/mm/fault.c --- 2.4.2/arch/i386/mm/fault.c Thu Feb 22 03:44:53 2001 +++ 2.4.2aa/arch/i386/mm/fault.cSat Feb 24 05:41:11 2001 @@ -326,23 +326,27 @@ int offset = __pgd_offset(address); pgd_t *pgd, *pgd_k; pmd_t *pmd, *pmd_k; + static spinlock_t lazy_vmalloc_lock = SPIN_LOCK_UNLOCKED; + unsigned long flags; pgd = tsk-active_mm-pgd + offset; pgd_k = init_mm.pgd + offset; - if (!pgd_present(*pgd)) { - if (!pgd_present(*pgd_k)) - goto bad_area_nosemaphore; + spin_lock_irqsave(lazy_vmalloc_lock, flags); + if (!pgd_present(*pgd) pgd_present(*pgd_k)) { set_pgd(pgd, *pgd_k); + spin_unlock_irqrestore(lazy_vmalloc_lock, flags); return; } - pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); - if (pmd_present(*pmd) || !pmd_present(*pmd_k)) - goto bad_area_nosemaphore; - set_pmd(pmd, *pmd_k); - return; + if (!pmd_present(*pmd) pmd_present(*pmd_k)) { + set_pmd(pmd, *pmd_k); + spin_unlock_irqrestore(lazy_vmalloc_lock, flags); + return; + } + spin_unlock_irqrestore(lazy_vmalloc_lock, flags); + goto bad_area_nosemaphore; } } diff -urN -X /home/andrea/bin/dontdiff 2.4.2/arch/i386/mm/init.c 2.4.2aa/arch/i386/mm/init.c --- 2.4.2/arch/i386/mm/init.c Sat Feb 10 02:34:03 2001 +++ 2.4.2aa/arch/i386/mm/init.c Fri Feb 23 19:09:25 2001 @@ -116,21 +116,13 @@ pte_t *pte; pte = (pte_t *) __get_free_page(GFP_KERNEL); - if (pmd_none(*pmd)) { - if (pte) { - clear_page(pte); - set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte))); - return pte + offset; - } - set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table(; - return NULL; + if (pte) { + clear_page(pte); + set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte))); + return pte + offset; } - free_page((unsigned long)pte); - if (pmd_bad(*pmd)) { - __handle_bad_pmd_kernel(pmd); - return NULL; - } - return (pte_t *) pmd_page(*pmd) + offset; + set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table(; + return NULL; } pte_t *get_pte_slow(pmd_t *pmd, unsigned long offset) diff -urN -X /home/andrea/bin/dontdiff 2.4.2/include/asm-i386/pgalloc-3level.h 2.4.2aa/include/asm-i386/pgalloc-3level.h --- 2.4.2/include/asm-i386/pgalloc-3level.h Fri Dec 3 20:12:23 1999 +++ 2.4.2aa/include/asm-i386/pgalloc-3level.h Fri Feb 23 19:03:14 2001 @@ -53,12 +53,9 @@ if (!page) page = get_pmd_slow(); if (page) { - if (pgd_none(*pgd)) { - set_pgd(pgd, __pgd(1 + __pa(page))); - __flush_tlb(); - return page + address; - } else - free_pmd_fast(page); + set_pgd(pgd, __pgd(1 + __pa(page))); + __flush_tlb(); + return page + address; } else return NULL; } non pae mode is still untested but it should now have a chance to work. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> >We can take page faults in interrupt handlers in 2.4 so I had to use a > >spinlock, but that sounds the same > > Umm? The above doesn't really make sense. > > We can take a page fault on the kernel region with the lazy page > directory filling, but that code will just set the PGD entry and exit > without taking any lock at all. So it basically ends up being an > "invisible" event. Its only normally invisible. Mark Hemment pointed out there is currently a race where if both cpus go to fill in the same entry the logic goes CPU1CPU2 pgd present pgd present pmd not present load pmd pmd present Explode messily The race looks right to me since both CPU's can be running from the same mm. The obvious fix (removing the 2nd check) of course hangs the WP check. I have a hack [not for Linus grade] for that now but really need to walk as far as the pte in the racey case to check for a WP fault. > 2.4.x. In that case you would take the exception table lock, but that is > true in both 2.2.x and in 2.4.x. I didnt say it wasnt Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> > We can take page faults in interrupt handlers in 2.4 so I had to use a > > spinlock, but that sounds the same > > We can? Woah, please explain. vmalloc does a lazy load of the tlb. That can lead to the exception table being walked on an IRQ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
We can take page faults in interrupt handlers in 2.4 so I had to use a spinlock, but that sounds the same We can? Woah, please explain. vmalloc does a lazy load of the tlb. That can lead to the exception table being walked on an IRQ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
We can take page faults in interrupt handlers in 2.4 so I had to use a spinlock, but that sounds the same Umm? The above doesn't really make sense. We can take a page fault on the kernel region with the lazy page directory filling, but that code will just set the PGD entry and exit without taking any lock at all. So it basically ends up being an "invisible" event. Its only normally invisible. Mark Hemment pointed out there is currently a race where if both cpus go to fill in the same entry the logic goes CPU1CPU2 pgd present pgd present pmd not present load pmd pmd present Explode messily The race looks right to me since both CPU's can be running from the same mm. The obvious fix (removing the 2nd check) of course hangs the WP check. I have a hack [not for Linus grade] for that now but really need to walk as far as the pte in the racey case to check for a WP fault. 2.4.x. In that case you would take the exception table lock, but that is true in both 2.2.x and in 2.4.x. I didnt say it wasnt Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
In message <[EMAIL PROTECTED]> you write: > > This is a while back, but I thought the solution Philipp and I came up > > with was to simply used a rw semaphore for this, which was taken (read > > only) on page fault if we have to scan the exception table. > > We can take page faults in interrupt handlers in 2.4 so I had to use a > spinlock, but that sounds the same We can? Woah, please explain. Rusty. -- Premature optmztion is rt of all evl. --DK - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
In article <[EMAIL PROTECTED]>, Alan Cox <[EMAIL PROTECTED]> wrote: >> This is a while back, but I thought the solution Philipp and I came up >> with was to simply used a rw semaphore for this, which was taken (read >> only) on page fault if we have to scan the exception table. > >We can take page faults in interrupt handlers in 2.4 so I had to use a >spinlock, but that sounds the same Umm? The above doesn't really make sense. We can take a page fault on the kernel region with the lazy page directory filling, but that code will just set the PGD entry and exit without taking any lock at all. So it basically ends up being an "invisible" event. Now, if an interrupt handler accesses kernel memory that just isn't there, that has _always_ taken a page fault, and that case is not new to 2.4.x. In that case you would take the exception table lock, but that is true in both 2.2.x and in 2.4.x. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> This is a while back, but I thought the solution Philipp and I came up > with was to simply used a rw semaphore for this, which was taken (read > only) on page fault if we have to scan the exception table. We can take page faults in interrupt handlers in 2.4 so I had to use a spinlock, but that sounds the same - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
This is a while back, but I thought the solution Philipp and I came up with was to simply used a rw semaphore for this, which was taken (read only) on page fault if we have to scan the exception table. We can take page faults in interrupt handlers in 2.4 so I had to use a spinlock, but that sounds the same - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
In article [EMAIL PROTECTED], Alan Cox [EMAIL PROTECTED] wrote: This is a while back, but I thought the solution Philipp and I came up with was to simply used a rw semaphore for this, which was taken (read only) on page fault if we have to scan the exception table. We can take page faults in interrupt handlers in 2.4 so I had to use a spinlock, but that sounds the same Umm? The above doesn't really make sense. We can take a page fault on the kernel region with the lazy page directory filling, but that code will just set the PGD entry and exit without taking any lock at all. So it basically ends up being an "invisible" event. Now, if an interrupt handler accesses kernel memory that just isn't there, that has _always_ taken a page fault, and that case is not new to 2.4.x. In that case you would take the exception table lock, but that is true in both 2.2.x and in 2.4.x. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
In message [EMAIL PROTECTED] you write: This is a while back, but I thought the solution Philipp and I came up with was to simply used a rw semaphore for this, which was taken (read only) on page fault if we have to scan the exception table. We can take page faults in interrupt handlers in 2.4 so I had to use a spinlock, but that sounds the same We can? Woah, please explain. Rusty. -- Premature optmztion is rt of all evl. --DK - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
In message <[EMAIL PROTECTED] .com> you write: > > We unlink the module > > We free the memory > > > > At the same time another cpu may be walking the exception table that we fre e. > > True. > > Rusty had a patch that locked the module list properly IIRC. This is a while back, but I thought the solution Philipp and I came up with was to simply used a rw semaphore for this, which was taken (read only) on page fault if we have to scan the exception table. Rusty. -- Premature optmztion is rt of all evl. --DK - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Tue, 20 Feb 2001, Keith Owens wrote: > On Mon, 19 Feb 2001 16:04:07 + (GMT), > Alan Cox <[EMAIL PROTECTED]> wrote: > Using wait_for_at_least_one_schedule_on_every_cpu() has no penalty > except on the process running rmmod. It does not require yet more > spinlocks and is architecture independent. Since schedule() updates > sched_data->last_schedule, all the rmmod process has to do is wait > until the last_schedule value changes on all cpus, forcing a reschedule > if necessary. > > Zero overhead in schedule, zero overhead in exception handling, zero > overhead in IA64 unwind code, architecture independent. Sounds good to > me. Not architecture independent unfortunately. get_cycles() always returns 0 on some SMP-capable machines. ->last_schedule doesn't change if one CPU is always idle (kernel threads do), or always running the same process (kernel threads do, unless it's an RT process in an endless loop). I'm not sure how you'd go about "forcing a reschedule if necessary". Using temporary kernel threads has zero overhead in {schedule, exception handling, IA64 unwind code} and actually works on all architectures. It adds overhead to the wait_for_at_least_one_schedule_on_every_cpu() code, but I think that's acceptable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Tue, 20 Feb 2001, Keith Owens wrote: On Mon, 19 Feb 2001 16:04:07 + (GMT), Alan Cox [EMAIL PROTECTED] wrote: Using wait_for_at_least_one_schedule_on_every_cpu() has no penalty except on the process running rmmod. It does not require yet more spinlocks and is architecture independent. Since schedule() updates sched_data-last_schedule, all the rmmod process has to do is wait until the last_schedule value changes on all cpus, forcing a reschedule if necessary. Zero overhead in schedule, zero overhead in exception handling, zero overhead in IA64 unwind code, architecture independent. Sounds good to me. Not architecture independent unfortunately. get_cycles() always returns 0 on some SMP-capable machines. -last_schedule doesn't change if one CPU is always idle (kernel threads do), or always running the same process (kernel threads do, unless it's an RT process in an endless loop). I'm not sure how you'd go about "forcing a reschedule if necessary". Using temporary kernel threads has zero overhead in {schedule, exception handling, IA64 unwind code} and actually works on all architectures. It adds overhead to the wait_for_at_least_one_schedule_on_every_cpu() code, but I think that's acceptable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
In message [EMAIL PROTECTED] .com you write: We unlink the module We free the memory At the same time another cpu may be walking the exception table that we fre e. True. Rusty had a patch that locked the module list properly IIRC. This is a while back, but I thought the solution Philipp and I came up with was to simply used a rw semaphore for this, which was taken (read only) on page fault if we have to scan the exception table. Rusty. -- Premature optmztion is rt of all evl. --DK - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001 16:04:07 + (GMT), Alan Cox <[EMAIL PROTECTED]> wrote: >My spinlock based fix has almost no contention and doesnt require 64 processors >grind to a halt on a big machine just to handle a module list change. Sorry >I don't think it supports your argument I am not proposing that the other cpus grind to a halt. On the contrary, they are allowed to continue running. It is the current process that is suspended until all other cpus have gone through at least one schedule. On Mon, 19 Feb 2001 16:23:09 +0100, Manfred Spraul <[EMAIL PROTECTED]> wrote: >what about > > spin_unlock_wait(_exception_lock); > >The actual exception table waker uses > spin_lock_irqsave(_exception_lock); > > spin_unlock_irqsave(_exception_lock); > >Or a simple spinlock - the code shouldn't be performance critical. All lock based fixes have the disadvantage that you penalise the entire kernel all the time. No matter how small the overhead of getting the lock, it still exists - so we are slowing down the main kernel all the time to handle the rare case of module unloading. Also notice that we keep adding spinlocks. One for the main module list, another for the exception tables. Then there is the architecture specific module data, including unwind information for IA64; that also needs to be locked. Requiring more than one spinlock to handle module unloading tells me that the design is wrong. Using wait_for_at_least_one_schedule_on_every_cpu() has no penalty except on the process running rmmod. It does not require yet more spinlocks and is architecture independent. Since schedule() updates sched_data->last_schedule, all the rmmod process has to do is wait until the last_schedule value changes on all cpus, forcing a reschedule if necessary. Zero overhead in schedule, zero overhead in exception handling, zero overhead in IA64 unwind code, architecture independent. Sounds good to me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001 14:25:39 +0100, Christoph Hellwig <[EMAIL PROTECTED]> wrote: >You just reinvented the read-copy-update model >(http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html)... > >The mail proposing that locking model for module unloading is not yet >in the arvhices, sorry. I did not reinvent it, I was following up on the discussion we already had off list about using this model. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> You can have exceptions while initializing. not > MOD_RUNNING|MOD_INITIALIZING should work though. True. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Alan Cox wrote: > > So you fixed the nonexistent race only. The real race is that the module > > Umm I fixed the small race. You are right that there is a second race. There's just one race. The small race is nonexistent since put_mod_name always acts as a memory barrier. > > uninitialized vmalloc'd (module_map'd) memory), then the module data > > (including the exception table) gets copied. > > The race window is from the first copy_from_user in sys_init_module until > > the second one. > > Yep. Obvious answer. Ignore exception tables for modules that are not > MOD_RUNNING. You can have exceptions while initializing. not MOD_RUNNING|MOD_INITIALIZING should work though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> So you fixed the nonexistent race only. The real race is that the module Umm I fixed the small race. You are right that there is a second race. > uninitialized vmalloc'd (module_map'd) memory), then the module data > (including the exception table) gets copied. > The race window is from the first copy_from_user in sys_init_module until > the second one. Yep. Obvious answer. Ignore exception tables for modules that are not MOD_RUNNING. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Alan Cox wrote: > > so you hold a spinlock during copy_from_user ? Or did you change > > sys_init_module/sys_create_modules semantics ? > > The only points I need to hold a spinlock are editing the chain and > walking it in a case where I dont hold the kernel lock. > > So I spin_lock_irqsave the two ops updating the list links in each case and > the exception lookup walk So you fixed the nonexistent race only. The real race is that the module structure gets initialized first (so the exception table pointers point to uninitialized vmalloc'd (module_map'd) memory), then the module data (including the exception table) gets copied. The race window is from the first copy_from_user in sys_init_module until the second one. Or am I missing something ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> so you hold a spinlock during copy_from_user ? Or did you change > sys_init_module/sys_create_modules semantics ? The only points I need to hold a spinlock are editing the chain and walking it in a case where I dont hold the kernel lock. So I spin_lock_irqsave the two ops updating the list links in each case and the exception lookup walk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Alan Cox wrote: > > Rusty had a patch that locked the module list properly IIRC. > > So does -ac now. I just take a spinlock for the modify cases that race > against faults (including vmalloc faults from irq context) so you hold a spinlock during copy_from_user ? Or did you change sys_init_module/sys_create_modules semantics ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> >At the same time another cpu may be walking the exception table that we free. > > Another good reason why locking modules via use counts from external > code is not the right fix. We definitely need a quiesce model for > module removal. My spinlock based fix has almost no contention and doesnt require 64 processors grind to a halt on a big machine just to handle a module list change. Sorry I don't think it supports your argument - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> Rusty had a patch that locked the module list properly IIRC. So does -ac now. I just take a spinlock for the modify cases that race against faults (including vmalloc faults from irq context) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
Keith Owens wrote: > wait_for_at_least_one_schedule_on_every_cpu(); what about spin_unlock_wait(_exception_lock); The actual exception table waker uses spin_lock_irqsave(_exception_lock); spin_unlock_irqsave(_exception_lock); Or a simple spinlock - the code shouldn't be performance critical. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Tue, 20 Feb 2001, Keith Owens wrote: > On Mon, 19 Feb 2001 06:15:22 -0600 (CST), > Philipp Rumpf <[EMAIL PROTECTED]> wrote: > No need for a callin routine, you can get this for free as part of > normal scheduling. The sequence goes :- > > if (use_count == 0) { > module_unregister(); > wait_for_at_least_one_schedule_on_every_cpu(); > if (use_count != 0) { > module_register();/* lost the unregister race */ > } > else { > /* nobody can enter the module now */ > module_release_resources(); > unlink_module_from_list(); > wait_for_at_least_one_schedule_on_every_cpu(); > free_module_storage(); > } > } > > wait_for_at_least_one_schedule_on_every_cpu() prevents the next wait_for_at_least_one_schedule_on_every_cpu() *is* callin_other_cpus(). I agree the name isn't optimal. (and yes, you could implement it by hacking sched.c directly, but I don't think that's necessary). > The beauty of this approach is that the rest of the cpus can do normal > work. No need to bring everything to a dead stop. Which nicely avoids potential deadlocks in modules that need to initialize on all CPUs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
In article <[EMAIL PROTECTED]> you wrote: > No need for a callin routine, you can get this for free as part of > normal scheduling. The sequence goes :- > > if (use_count == 0) { > module_unregister(); > wait_for_at_least_one_schedule_on_every_cpu(); > if (use_count != 0) { > module_register();/* lost the unregister race */ > } > else { > /* nobody can enter the module now */ > module_release_resources(); > unlink_module_from_list(); > wait_for_at_least_one_schedule_on_every_cpu(); > free_module_storage(); > } > } > > wait_for_at_least_one_schedule_on_every_cpu() prevents the next > operation until at least one schedule has been executed on every cpu. > Whether this is done as a call back or a separate kernel thread that > schedules itself on every cpu or the current process scheduling itself > on every cpu is an implementation detail. All that matters is that any > other cpu that might have been accessing the module has gone through > schedule and therefore is no longer accessing the module's data or > code. You just reinvented the read-copy-update model (http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html)... The mail proposing that locking model for module unloading is not yet in the arvhices, sorry. Christoph P.S. Weren't you Cc:'ed on that mail? -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001 06:15:22 -0600 (CST), Philipp Rumpf <[EMAIL PROTECTED]> wrote: >Unless I'm mistaken, we need both use counts and SMP magic (though not >necessarily as extreme as what the "freeze all other CPUs during module >unload" patch did). > >I think something like this would work (in addition to use counts) > >int callin_func(void *p) >{ > int *cpu = p; > > while (*cpu != smp_processor_id()) { > current->cpus_allowed = 1 << *cpu; > schedule(); > } > > return 0; >} > >void callin_other_cpus(void) >{ > int cpus[smp_num_cpus]; > int i; > > for (i=0; i cpus[i] = i; > > kernel_thread(callin_func, [i], ...); > } >} > >and call callin_other_cpus() before unloading a module. > >I'm not sure how you could make exception handling safe without locking >all accesses to the module list - but that sounds like the sane thing to >do anyway. No need for a callin routine, you can get this for free as part of normal scheduling. The sequence goes :- if (use_count == 0) { module_unregister(); wait_for_at_least_one_schedule_on_every_cpu(); if (use_count != 0) { module_register(); /* lost the unregister race */ } else { /* nobody can enter the module now */ module_release_resources(); unlink_module_from_list(); wait_for_at_least_one_schedule_on_every_cpu(); free_module_storage(); } } wait_for_at_least_one_schedule_on_every_cpu() prevents the next operation until at least one schedule has been executed on every cpu. Whether this is done as a call back or a separate kernel thread that schedules itself on every cpu or the current process scheduling itself on every cpu is an implementation detail. All that matters is that any other cpu that might have been accessing the module has gone through schedule and therefore is no longer accessing the module's data or code. The beauty of this approach is that the rest of the cpus can do normal work. No need to bring everything to a dead stop. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Keith Owens wrote: > On Mon, 19 Feb 2001 11:35:08 + (GMT), > Alan Cox <[EMAIL PROTECTED]> wrote: > >The problem isnt running module code. What happens in this case > > > >mod->next = module_list; > >module_list = mod; /* link it in */ > > > >Note no write barrier. > > It works on ix86 so the code must be right Too bad it doesn't. > >Delete is even worse > > > >We unlink the module > >We free the memory > > > >At the same time another cpu may be walking the exception table that we free. > > Another good reason why locking modules via use counts from external > code is not the right fix. We definitely need a quiesce model for > module removal. Unless I'm mistaken, we need both use counts and SMP magic (though not necessarily as extreme as what the "freeze all other CPUs during module unload" patch did). I think something like this would work (in addition to use counts) int callin_func(void *p) { int *cpu = p; while (*cpu != smp_processor_id()) { current->cpus_allowed = 1 << *cpu; schedule(); } return 0; } void callin_other_cpus(void) { int cpus[smp_num_cpus]; int i; for (i=0; ihttp://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001 11:35:08 + (GMT), Alan Cox <[EMAIL PROTECTED]> wrote: >The problem isnt running module code. What happens in this case > >mod->next = module_list; >module_list = mod; /* link it in */ > >Note no write barrier. It works on ix86 so the code must be right >Delete is even worse > >We unlink the module >We free the memory > >At the same time another cpu may be walking the exception table that we free. Another good reason why locking modules via use counts from external code is not the right fix. We definitely need a quiesce model for module removal. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Alan Cox wrote: > mod->next = module_list; put_mod_name() here. > module_list = mod; /* link it in */ > > Note no write barrier. put_mod_name calls free_page which always implies a memory barrier. This isn't beautiful but it won't blow up either. Actually that isn't relevant since the actual table is copied _after_ ex_table_{start,end}. We'll scan uninitialized memory and if some word happens to match the fault EIP we jump to the bogus fixup. > Delete is even worse > > We unlink the module > We free the memory > > At the same time another cpu may be walking the exception table that we free. True. Rusty had a patch that locked the module list properly IIRC. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
> The module list is modified atomically, so either we search the new table > or we don't, but we never see intermediate states. Not searching the new > table shouldn't be a problem as we shouldn't run module code until > sys_init_module time. The problem isnt running module code. What happens in this case mod->next = module_list; module_list = mod; /* link it in */ Note no write barrier. Delete is even worse We unlink the module We free the memory At the same time another cpu may be walking the exception table that we free. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Thu, 15 Feb 2001, Alan Cox wrote: > Question of the day for the VM folks: > If CPU1 is loading the exception tables for a module and > CPU2 faults.. what happens 8) "loading" == in sys_create_module ? The module list is modified atomically, so either we search the new table or we don't, but we never see intermediate states. Not searching the new table shouldn't be a problem as we shouldn't run module code until sys_init_module time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Thu, 15 Feb 2001, Alan Cox wrote: Question of the day for the VM folks: If CPU1 is loading the exception tables for a module and CPU2 faults.. what happens 8) "loading" == in sys_create_module ? The module list is modified atomically, so either we search the new table or we don't, but we never see intermediate states. Not searching the new table shouldn't be a problem as we shouldn't run module code until sys_init_module time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
The module list is modified atomically, so either we search the new table or we don't, but we never see intermediate states. Not searching the new table shouldn't be a problem as we shouldn't run module code until sys_init_module time. The problem isnt running module code. What happens in this case mod-next = module_list; module_list = mod; /* link it in */ Note no write barrier. Delete is even worse We unlink the module We free the memory At the same time another cpu may be walking the exception table that we free. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Alan Cox wrote: mod-next = module_list; put_mod_name() here. module_list = mod; /* link it in */ Note no write barrier. put_mod_name calls free_page which always implies a memory barrier. This isn't beautiful but it won't blow up either. Actually that isn't relevant since the actual table is copied _after_ ex_table_{start,end}. We'll scan uninitialized memory and if some word happens to match the fault EIP we jump to the bogus fixup. Delete is even worse We unlink the module We free the memory At the same time another cpu may be walking the exception table that we free. True. Rusty had a patch that locked the module list properly IIRC. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001 11:35:08 + (GMT), Alan Cox [EMAIL PROTECTED] wrote: The problem isnt running module code. What happens in this case mod-next = module_list; module_list = mod; /* link it in */ Note no write barrier. humourIt works on ix86 so the code must be right/humour Delete is even worse We unlink the module We free the memory At the same time another cpu may be walking the exception table that we free. Another good reason why locking modules via use counts from external code is not the right fix. We definitely need a quiesce model for module removal. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Keith Owens wrote: On Mon, 19 Feb 2001 11:35:08 + (GMT), Alan Cox [EMAIL PROTECTED] wrote: The problem isnt running module code. What happens in this case mod-next = module_list; module_list = mod; /* link it in */ Note no write barrier. humourIt works on ix86 so the code must be right/humour Too bad it doesn't. Delete is even worse We unlink the module We free the memory At the same time another cpu may be walking the exception table that we free. Another good reason why locking modules via use counts from external code is not the right fix. We definitely need a quiesce model for module removal. Unless I'm mistaken, we need both use counts and SMP magic (though not necessarily as extreme as what the "freeze all other CPUs during module unload" patch did). I think something like this would work (in addition to use counts) int callin_func(void *p) { int *cpu = p; while (*cpu != smp_processor_id()) { current-cpus_allowed = 1 *cpu; schedule(); } return 0; } void callin_other_cpus(void) { int cpus[smp_num_cpus]; int i; for (i=0; ismp_num_cpus; i++) { cpus[i] = i; kernel_thread(callin_func, cpus[i], ...); } } and call callin_other_cpus() before unloading a module. I'm not sure how you could make exception handling safe without locking all accesses to the module list - but that sounds like the sane thing to do anyway. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001 06:15:22 -0600 (CST), Philipp Rumpf [EMAIL PROTECTED] wrote: Unless I'm mistaken, we need both use counts and SMP magic (though not necessarily as extreme as what the "freeze all other CPUs during module unload" patch did). I think something like this would work (in addition to use counts) int callin_func(void *p) { int *cpu = p; while (*cpu != smp_processor_id()) { current-cpus_allowed = 1 *cpu; schedule(); } return 0; } void callin_other_cpus(void) { int cpus[smp_num_cpus]; int i; for (i=0; ismp_num_cpus; i++) { cpus[i] = i; kernel_thread(callin_func, cpus[i], ...); } } and call callin_other_cpus() before unloading a module. I'm not sure how you could make exception handling safe without locking all accesses to the module list - but that sounds like the sane thing to do anyway. No need for a callin routine, you can get this for free as part of normal scheduling. The sequence goes :- if (use_count == 0) { module_unregister(); wait_for_at_least_one_schedule_on_every_cpu(); if (use_count != 0) { module_register(); /* lost the unregister race */ } else { /* nobody can enter the module now */ module_release_resources(); unlink_module_from_list(); wait_for_at_least_one_schedule_on_every_cpu(); free_module_storage(); } } wait_for_at_least_one_schedule_on_every_cpu() prevents the next operation until at least one schedule has been executed on every cpu. Whether this is done as a call back or a separate kernel thread that schedules itself on every cpu or the current process scheduling itself on every cpu is an implementation detail. All that matters is that any other cpu that might have been accessing the module has gone through schedule and therefore is no longer accessing the module's data or code. The beauty of this approach is that the rest of the cpus can do normal work. No need to bring everything to a dead stop. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
In article [EMAIL PROTECTED] you wrote: No need for a callin routine, you can get this for free as part of normal scheduling. The sequence goes :- if (use_count == 0) { module_unregister(); wait_for_at_least_one_schedule_on_every_cpu(); if (use_count != 0) { module_register();/* lost the unregister race */ } else { /* nobody can enter the module now */ module_release_resources(); unlink_module_from_list(); wait_for_at_least_one_schedule_on_every_cpu(); free_module_storage(); } } wait_for_at_least_one_schedule_on_every_cpu() prevents the next operation until at least one schedule has been executed on every cpu. Whether this is done as a call back or a separate kernel thread that schedules itself on every cpu or the current process scheduling itself on every cpu is an implementation detail. All that matters is that any other cpu that might have been accessing the module has gone through schedule and therefore is no longer accessing the module's data or code. You just reinvented the read-copy-update model (http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html)... The mail proposing that locking model for module unloading is not yet in the arvhices, sorry. Christoph P.S. Weren't you Cc:'ed on that mail? -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Tue, 20 Feb 2001, Keith Owens wrote: On Mon, 19 Feb 2001 06:15:22 -0600 (CST), Philipp Rumpf [EMAIL PROTECTED] wrote: No need for a callin routine, you can get this for free as part of normal scheduling. The sequence goes :- if (use_count == 0) { module_unregister(); wait_for_at_least_one_schedule_on_every_cpu(); if (use_count != 0) { module_register();/* lost the unregister race */ } else { /* nobody can enter the module now */ module_release_resources(); unlink_module_from_list(); wait_for_at_least_one_schedule_on_every_cpu(); free_module_storage(); } } wait_for_at_least_one_schedule_on_every_cpu() prevents the next wait_for_at_least_one_schedule_on_every_cpu() *is* callin_other_cpus(). I agree the name isn't optimal. (and yes, you could implement it by hacking sched.c directly, but I don't think that's necessary). The beauty of this approach is that the rest of the cpus can do normal work. No need to bring everything to a dead stop. Which nicely avoids potential deadlocks in modules that need to initialize on all CPUs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
Keith Owens wrote: wait_for_at_least_one_schedule_on_every_cpu(); what about spin_unlock_wait(global_exception_lock); The actual exception table waker uses spin_lock_irqsave(global_exception_lock); spin_unlock_irqsave(global_exception_lock); Or a simple spinlock - the code shouldn't be performance critical. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
Rusty had a patch that locked the module list properly IIRC. So does -ac now. I just take a spinlock for the modify cases that race against faults (including vmalloc faults from irq context) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
At the same time another cpu may be walking the exception table that we free. Another good reason why locking modules via use counts from external code is not the right fix. We definitely need a quiesce model for module removal. My spinlock based fix has almost no contention and doesnt require 64 processors grind to a halt on a big machine just to handle a module list change. Sorry I don't think it supports your argument - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Alan Cox wrote: Rusty had a patch that locked the module list properly IIRC. So does -ac now. I just take a spinlock for the modify cases that race against faults (including vmalloc faults from irq context) so you hold a spinlock during copy_from_user ? Or did you change sys_init_module/sys_create_modules semantics ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
so you hold a spinlock during copy_from_user ? Or did you change sys_init_module/sys_create_modules semantics ? The only points I need to hold a spinlock are editing the chain and walking it in a case where I dont hold the kernel lock. So I spin_lock_irqsave the two ops updating the list links in each case and the exception lookup walk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Alan Cox wrote: so you hold a spinlock during copy_from_user ? Or did you change sys_init_module/sys_create_modules semantics ? The only points I need to hold a spinlock are editing the chain and walking it in a case where I dont hold the kernel lock. So I spin_lock_irqsave the two ops updating the list links in each case and the exception lookup walk So you fixed the nonexistent race only. The real race is that the module structure gets initialized first (so the exception table pointers point to uninitialized vmalloc'd (module_map'd) memory), then the module data (including the exception table) gets copied. The race window is from the first copy_from_user in sys_init_module until the second one. Or am I missing something ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
So you fixed the nonexistent race only. The real race is that the module Umm I fixed the small race. You are right that there is a second race. uninitialized vmalloc'd (module_map'd) memory), then the module data (including the exception table) gets copied. The race window is from the first copy_from_user in sys_init_module until the second one. Yep. Obvious answer. Ignore exception tables for modules that are not MOD_RUNNING. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001, Alan Cox wrote: So you fixed the nonexistent race only. The real race is that the module Umm I fixed the small race. You are right that there is a second race. There's just one race. The small race is nonexistent since put_mod_name always acts as a memory barrier. uninitialized vmalloc'd (module_map'd) memory), then the module data (including the exception table) gets copied. The race window is from the first copy_from_user in sys_init_module until the second one. Yep. Obvious answer. Ignore exception tables for modules that are not MOD_RUNNING. You can have exceptions while initializing. not MOD_RUNNING|MOD_INITIALIZING should work though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
You can have exceptions while initializing. not MOD_RUNNING|MOD_INITIALIZING should work though. True. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001 14:25:39 +0100, Christoph Hellwig [EMAIL PROTECTED] wrote: You just reinvented the read-copy-update model (http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html)... The mail proposing that locking model for module unloading is not yet in the arvhices, sorry. I did not reinvent it, I was following up on the discussion we already had off list about using this model. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.1-ac15
On Mon, 19 Feb 2001 16:04:07 + (GMT), Alan Cox [EMAIL PROTECTED] wrote: My spinlock based fix has almost no contention and doesnt require 64 processors grind to a halt on a big machine just to handle a module list change. Sorry I don't think it supports your argument I am not proposing that the other cpus grind to a halt. On the contrary, they are allowed to continue running. It is the current process that is suspended until all other cpus have gone through at least one schedule. On Mon, 19 Feb 2001 16:23:09 +0100, Manfred Spraul [EMAIL PROTECTED] wrote: what about spin_unlock_wait(global_exception_lock); The actual exception table waker uses spin_lock_irqsave(global_exception_lock); spin_unlock_irqsave(global_exception_lock); Or a simple spinlock - the code shouldn't be performance critical. All lock based fixes have the disadvantage that you penalise the entire kernel all the time. No matter how small the overhead of getting the lock, it still exists - so we are slowing down the main kernel all the time to handle the rare case of module unloading. Also notice that we keep adding spinlocks. One for the main module list, another for the exception tables. Then there is the architecture specific module data, including unwind information for IA64; that also needs to be locked. Requiring more than one spinlock to handle module unloading tells me that the design is wrong. Using wait_for_at_least_one_schedule_on_every_cpu() has no penalty except on the process running rmmod. It does not require yet more spinlocks and is architecture independent. Since schedule() updates sched_data-last_schedule, all the rmmod process has to do is wait until the last_schedule value changes on all cpus, forcing a reschedule if necessary. Zero overhead in schedule, zero overhead in exception handling, zero overhead in IA64 unwind code, architecture independent. Sounds good to me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/