Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
John Baldwin wrote: On Sunday 15 June 2008 07:23:19 am Stef Walter wrote: I've been trying to track down a deadlock on some newish production servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a specific (although mundane) hardware configuration, and each of several servers running this hardware deadlock about once per week. Try this change: snip We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD when softclock() (swi4: clock) blocks on a lock like Giant. Just wanted to confirm: That patch did the trick. All the SMP machines that had this problem have been stable for 11 days now, longer than any of them were up previously. I changed the patch slightly to work with FreeBSD 6.3-RELEASE. That's attached, in case anyone needs this later. Cheers, Stef --- sys/kern/sched_4bsd.c.orig 2006-06-16 22:11:55.0 + +++ sys/kern/sched_4bsd.c 2008-06-18 17:04:34.0 + @@ -157,13 +157,10 @@ static int sched_quantum; /* Roundrobin scheduling quantum in ticks. */ #define SCHED_QUANTUM (hz / 10) /* Default sched quantum */ -static struct callout roundrobin_callout; - static void slot_fill(struct ksegrp *kg); static struct kse *sched_choose(void); /* XXX Should be thread * */ static void setup_runqs(void); -static void roundrobin(void *arg); static void schedcpu(void); static void schedcpu_thread(void); static void sched_priority(struct thread *td, u_char prio); @@ -316,27 +313,6 @@ } /* - * Force switch among equal priority processes every 100ms. - * We don't actually need to force a context switch of the current process. - * The act of firing the event triggers a context switch to softclock() and - * then switching back out again which is equivalent to a preemption, thus - * no further work is needed on the local CPU. - */ -/* ARGSUSED */ -static void -roundrobin(void *arg) -{ - -#ifdef SMP - mtx_lock_spin(sched_lock); - forward_roundrobin(); - mtx_unlock_spin(sched_lock); -#endif - - callout_reset(roundrobin_callout, sched_quantum, roundrobin, NULL); -} - -/* * Constants for digital decay and forget: * 90% of (kg_estcpu) usage in 5 * loadav time * 95% of (ke_pctcpu) usage in 60 seconds (load insensitive) @@ -618,11 +594,6 @@ sched_quantum = SCHED_QUANTUM; hogticks = 2 * sched_quantum; - callout_init(roundrobin_callout, CALLOUT_MPSAFE); - - /* Kick off timeout driven events by calling first time. */ - roundrobin(NULL); - /* Account for thread0. */ sched_load_add(); } @@ -697,6 +668,14 @@ resetpriority(kg); resetpriority_thread(td, kg); } + + /* + * Force a context switch if the current thread has used up a full + * quantum (default quantum is 100ms). + */ + if (!((td)-td_flags TDF_IDLETD) + ticks - PCPU_GET(switchticks) = sched_quantum) + td-td_flags |= TDF_NEEDRESCHED; } /* ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
On Monday 23 June 2008 03:16:40 pm James Gritton wrote: John Baldwin wrote: On Thursday 19 June 2008 11:57:51 am James Gritton wrote: John Baldwin wrote: On Sunday 15 June 2008 07:23:19 am Stef Walter wrote: I've been trying to track down a deadlock on some newish production servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a specific (although mundane) hardware configuration, and each of several servers running this hardware deadlock about once per week. Although I suspect that this is not hardware related, from a (naive) perusal of the attached stack traces. Forgive me if my interpretation of this is all wrong, but I'm pretty desperate for help. So here's my basic understanding of the deadlock: These processes seem to be waiting on the page queue mutex: sendmail (in vm_mmap vm_map_find vm_map_insert vm_map_pmap_enter) bsnmpd (in malloc, uma_large_malloc page_alloc kmem_malloc) httpd (in trap trap_pfault vm_fault) [g_up] (in g_vfs_done bufdone) The page queue mutex is held by rsync process: rsync (in trap trap_pfault vm_fault pmap_enter) Rsync kernel process (in pmap_enter) was interrupted while holding the page queue lock? Giant is enabled in loader.conf due to the needs of the pf firewall when dealing with user credentials lookups. I do not believe that Giant plays into this deadlock. Kernel config attached. Any and all help or info is welcome. Thanks in advance. Try this change: jhb 2007-10-27 22:07:40 UTC FreeBSD src repository Modified files: sys/kern sched_4bsd.c Log: Change the roundrobin implementation in the 4BSD scheduler to trigger a userland preemption directly from hardclock() via sched_clock() when a thread uses up a full quantum instead of using a periodic timeout to cause a userland preemption every so often. This fixes a potential deadlock when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held by a thread pinned or bound to another CPU. The current thread on that CPU will never be preempted while softclock is blocked. Note that ULE already drives its round-robin userland preemption from sched_clock() as well and always enables IPI_PREEMPT. MFC after: 1 week Revision ChangesPath 1.108 +8 -29 src/sys/kern/sched_4bsd.c We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD when softclock() (swi4: clock) blocks on a lock like Giant. I've been seeing similar troubles on 6.2 and I'll have to give this a try as we upgrade to 6.3. I notice MFC after: 1 week in the log; it's been a week - any chance of seeing this fix rolled into 6.x? If people confirm it fixes issues I will MFC it. There was some pushback when I first committed it so I waited on the MFC. I can confirm that on 6.3 I can recreate the deadlock without the patch, and can't recreate it with the patch. Ok, I've merged it to RELENG_[67]. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
On Thursday 19 June 2008 11:57:51 am James Gritton wrote: John Baldwin wrote: On Sunday 15 June 2008 07:23:19 am Stef Walter wrote: I've been trying to track down a deadlock on some newish production servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a specific (although mundane) hardware configuration, and each of several servers running this hardware deadlock about once per week. Although I suspect that this is not hardware related, from a (naive) perusal of the attached stack traces. Forgive me if my interpretation of this is all wrong, but I'm pretty desperate for help. So here's my basic understanding of the deadlock: These processes seem to be waiting on the page queue mutex: sendmail (in vm_mmap vm_map_find vm_map_insert vm_map_pmap_enter) bsnmpd (in malloc, uma_large_malloc page_alloc kmem_malloc) httpd (in trap trap_pfault vm_fault) [g_up] (in g_vfs_done bufdone) The page queue mutex is held by rsync process: rsync (in trap trap_pfault vm_fault pmap_enter) Rsync kernel process (in pmap_enter) was interrupted while holding the page queue lock? Giant is enabled in loader.conf due to the needs of the pf firewall when dealing with user credentials lookups. I do not believe that Giant plays into this deadlock. Kernel config attached. Any and all help or info is welcome. Thanks in advance. Try this change: jhb 2007-10-27 22:07:40 UTC FreeBSD src repository Modified files: sys/kern sched_4bsd.c Log: Change the roundrobin implementation in the 4BSD scheduler to trigger a userland preemption directly from hardclock() via sched_clock() when a thread uses up a full quantum instead of using a periodic timeout to cause a userland preemption every so often. This fixes a potential deadlock when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held by a thread pinned or bound to another CPU. The current thread on that CPU will never be preempted while softclock is blocked. Note that ULE already drives its round-robin userland preemption from sched_clock() as well and always enables IPI_PREEMPT. MFC after: 1 week Revision ChangesPath 1.108 +8 -29 src/sys/kern/sched_4bsd.c We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD when softclock() (swi4: clock) blocks on a lock like Giant. I've been seeing similar troubles on 6.2 and I'll have to give this a try as we upgrade to 6.3. I notice MFC after: 1 week in the log; it's been a week - any chance of seeing this fix rolled into 6.x? If people confirm it fixes issues I will MFC it. There was some pushback when I first committed it so I waited on the MFC. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
John Baldwin wrote: On Thursday 19 June 2008 11:57:51 am James Gritton wrote: John Baldwin wrote: On Sunday 15 June 2008 07:23:19 am Stef Walter wrote: I've been trying to track down a deadlock on some newish production servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a specific (although mundane) hardware configuration, and each of several servers running this hardware deadlock about once per week. Although I suspect that this is not hardware related, from a (naive) perusal of the attached stack traces. Forgive me if my interpretation of this is all wrong, but I'm pretty desperate for help. So here's my basic understanding of the deadlock: These processes seem to be waiting on the page queue mutex: sendmail (in vm_mmap vm_map_find vm_map_insert vm_map_pmap_enter) bsnmpd (in malloc, uma_large_malloc page_alloc kmem_malloc) httpd (in trap trap_pfault vm_fault) [g_up] (in g_vfs_done bufdone) The page queue mutex is held by rsync process: rsync (in trap trap_pfault vm_fault pmap_enter) Rsync kernel process (in pmap_enter) was interrupted while holding the page queue lock? Giant is enabled in loader.conf due to the needs of the pf firewall when dealing with user credentials lookups. I do not believe that Giant plays into this deadlock. Kernel config attached. Any and all help or info is welcome. Thanks in advance. Try this change: jhb 2007-10-27 22:07:40 UTC FreeBSD src repository Modified files: sys/kern sched_4bsd.c Log: Change the roundrobin implementation in the 4BSD scheduler to trigger a userland preemption directly from hardclock() via sched_clock() when a thread uses up a full quantum instead of using a periodic timeout to cause a userland preemption every so often. This fixes a potential deadlock when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held by a thread pinned or bound to another CPU. The current thread on that CPU will never be preempted while softclock is blocked. Note that ULE already drives its round-robin userland preemption from sched_clock() as well and always enables IPI_PREEMPT. MFC after: 1 week Revision ChangesPath 1.108 +8 -29 src/sys/kern/sched_4bsd.c We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD when softclock() (swi4: clock) blocks on a lock like Giant. I've been seeing similar troubles on 6.2 and I'll have to give this a try as we upgrade to 6.3. I notice MFC after: 1 week in the log; it's been a week - any chance of seeing this fix rolled into 6.x? If people confirm it fixes issues I will MFC it. There was some pushback when I first committed it so I waited on the MFC. I can confirm that on 6.3 I can recreate the deadlock without the patch, and can't recreate it with the patch. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
John Baldwin wrote: On Sunday 15 June 2008 07:23:19 am Stef Walter wrote: I've been trying to track down a deadlock on some newish production servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a specific (although mundane) hardware configuration, and each of several servers running this hardware deadlock about once per week. Although I suspect that this is not hardware related, from a (naive) perusal of the attached stack traces. Forgive me if my interpretation of this is all wrong, but I'm pretty desperate for help. So here's my basic understanding of the deadlock: These processes seem to be waiting on the page queue mutex: sendmail (in vm_mmap vm_map_find vm_map_insert vm_map_pmap_enter) bsnmpd (in malloc, uma_large_malloc page_alloc kmem_malloc) httpd (in trap trap_pfault vm_fault) [g_up] (in g_vfs_done bufdone) The page queue mutex is held by rsync process: rsync (in trap trap_pfault vm_fault pmap_enter) Rsync kernel process (in pmap_enter) was interrupted while holding the page queue lock? Giant is enabled in loader.conf due to the needs of the pf firewall when dealing with user credentials lookups. I do not believe that Giant plays into this deadlock. Kernel config attached. Any and all help or info is welcome. Thanks in advance. Try this change: jhb 2007-10-27 22:07:40 UTC FreeBSD src repository Modified files: sys/kern sched_4bsd.c Log: Change the roundrobin implementation in the 4BSD scheduler to trigger a userland preemption directly from hardclock() via sched_clock() when a thread uses up a full quantum instead of using a periodic timeout to cause a userland preemption every so often. This fixes a potential deadlock when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held by a thread pinned or bound to another CPU. The current thread on that CPU will never be preempted while softclock is blocked. Note that ULE already drives its round-robin userland preemption from sched_clock() as well and always enables IPI_PREEMPT. MFC after: 1 week Revision ChangesPath 1.108 +8 -29 src/sys/kern/sched_4bsd.c We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD when softclock() (swi4: clock) blocks on a lock like Giant. I've been seeing similar troubles on 6.2 and I'll have to give this a try as we upgrade to 6.3. I notice MFC after: 1 week in the log; it's been a week - any chance of seeing this fix rolled into 6.x? - Jamie ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
On Sunday 15 June 2008 07:23:19 am Stef Walter wrote: I've been trying to track down a deadlock on some newish production servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a specific (although mundane) hardware configuration, and each of several servers running this hardware deadlock about once per week. Although I suspect that this is not hardware related, from a (naive) perusal of the attached stack traces. Forgive me if my interpretation of this is all wrong, but I'm pretty desperate for help. So here's my basic understanding of the deadlock: These processes seem to be waiting on the page queue mutex: sendmail (in vm_mmap vm_map_find vm_map_insert vm_map_pmap_enter) bsnmpd (in malloc, uma_large_malloc page_alloc kmem_malloc) httpd (in trap trap_pfault vm_fault) [g_up] (in g_vfs_done bufdone) The page queue mutex is held by rsync process: rsync (in trap trap_pfault vm_fault pmap_enter) Rsync kernel process (in pmap_enter) was interrupted while holding the page queue lock? Giant is enabled in loader.conf due to the needs of the pf firewall when dealing with user credentials lookups. I do not believe that Giant plays into this deadlock. Kernel config attached. Any and all help or info is welcome. Thanks in advance. Try this change: jhb 2007-10-27 22:07:40 UTC FreeBSD src repository Modified files: sys/kern sched_4bsd.c Log: Change the roundrobin implementation in the 4BSD scheduler to trigger a userland preemption directly from hardclock() via sched_clock() when a thread uses up a full quantum instead of using a periodic timeout to cause a userland preemption every so often. This fixes a potential deadlock when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held by a thread pinned or bound to another CPU. The current thread on that CPU will never be preempted while softclock is blocked. Note that ULE already drives its round-robin userland preemption from sched_clock() as well and always enables IPI_PREEMPT. MFC after: 1 week Revision ChangesPath 1.108 +8 -29 src/sys/kern/sched_4bsd.c We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD when softclock() (swi4: clock) blocks on a lock like Giant. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
John Baldwin wrote: Try this change: jhb 2007-10-27 22:07:40 UTC snip We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD when softclock() (swi4: clock) blocks on a lock like Giant. Awesome. Thanks. That looks like it'll do the trick. I'll deploy it and keep the list posted. Stef ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]