Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output

2008-06-30 Thread Stef
John Baldwin wrote:
 On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:
 I've been trying to track down a deadlock on some newish production
 servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
 specific (although mundane) hardware configuration, and each of several
 servers running this hardware deadlock about once per week.
 
 Try this change:
 
snip
 We use it at work on 6.x.  W/o this fix, round-robin stops working on 4BSD 
 when softclock() (swi4: clock) blocks on a lock like Giant.

Just wanted to confirm: That patch did the trick. All the SMP machines
that had this problem have been stable for 11 days now, longer than any
of them were up previously.

I changed the patch slightly to work with FreeBSD 6.3-RELEASE. That's
attached, in case anyone needs this later.

Cheers,
Stef
--- sys/kern/sched_4bsd.c.orig	2006-06-16 22:11:55.0 +
+++ sys/kern/sched_4bsd.c	2008-06-18 17:04:34.0 +
@@ -157,13 +157,10 @@
 static int	sched_quantum;	/* Roundrobin scheduling quantum in ticks. */
 #define	SCHED_QUANTUM	(hz / 10)	/* Default sched quantum */
 
-static struct callout roundrobin_callout;
-
 static void	slot_fill(struct ksegrp *kg);
 static struct kse *sched_choose(void);		/* XXX Should be thread * */
 
 static void	setup_runqs(void);
-static void	roundrobin(void *arg);
 static void	schedcpu(void);
 static void	schedcpu_thread(void);
 static void	sched_priority(struct thread *td, u_char prio);
@@ -316,27 +313,6 @@
 }
 
 /*
- * Force switch among equal priority processes every 100ms.
- * We don't actually need to force a context switch of the current process.
- * The act of firing the event triggers a context switch to softclock() and
- * then switching back out again which is equivalent to a preemption, thus
- * no further work is needed on the local CPU.
- */
-/* ARGSUSED */
-static void
-roundrobin(void *arg)
-{
-
-#ifdef SMP
-	mtx_lock_spin(sched_lock);
-	forward_roundrobin();
-	mtx_unlock_spin(sched_lock);
-#endif
-
-	callout_reset(roundrobin_callout, sched_quantum, roundrobin, NULL);
-}
-
-/*
  * Constants for digital decay and forget:
  *	90% of (kg_estcpu) usage in 5 * loadav time
  *	95% of (ke_pctcpu) usage in 60 seconds (load insensitive)
@@ -618,11 +594,6 @@
 		sched_quantum = SCHED_QUANTUM;
 	hogticks = 2 * sched_quantum;
 
-	callout_init(roundrobin_callout, CALLOUT_MPSAFE);
-
-	/* Kick off timeout driven events by calling first time. */
-	roundrobin(NULL);
-
 	/* Account for thread0. */
 	sched_load_add();
 }
@@ -697,6 +668,14 @@
 		resetpriority(kg);
 		resetpriority_thread(td, kg);
 	}
+
+	/*
+	 * Force a context switch if the current thread has used up a full
+	 * quantum (default quantum is 100ms).
+	 */
+	if (!((td)-td_flags  TDF_IDLETD) 
+	ticks - PCPU_GET(switchticks) = sched_quantum)
+		td-td_flags |= TDF_NEEDRESCHED;
 }
 
 /*
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output

2008-06-24 Thread John Baldwin
On Monday 23 June 2008 03:16:40 pm James Gritton wrote:
 John Baldwin wrote:
  On Thursday 19 June 2008 11:57:51 am James Gritton wrote:

  John Baldwin wrote:
  
  On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:


  I've been trying to track down a deadlock on some newish production
  servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
  specific (although mundane) hardware configuration, and each of several
  servers running this hardware deadlock about once per week.
 
  Although I suspect that this is not hardware related, from a (naive)
  perusal of the attached stack traces.
 
  Forgive me if my interpretation of this is all wrong, but I'm pretty
  desperate for help. So here's my basic understanding of the deadlock:
 
  These processes seem to be waiting on the page queue mutex:
   sendmail (in vm_mmap  vm_map_find  vm_map_insert  
vm_map_pmap_enter)
   bsnmpd (in malloc, uma_large_malloc  page_alloc  kmem_malloc)
   httpd (in trap  trap_pfault  vm_fault)
   [g_up] (in g_vfs_done  bufdone)
 
  The page queue mutex is held by rsync process:
   rsync (in trap  trap_pfault  vm_fault  pmap_enter)
 
  Rsync kernel process (in pmap_enter) was interrupted while holding the
  page queue lock?
 
 
  Giant is enabled in loader.conf due to the needs of the pf firewall 
when
  dealing with user credentials lookups. I do not believe that Giant 
plays
  into this deadlock. Kernel config attached.
 
  Any and all help or info is welcome. Thanks in advance.
  
  
  Try this change:
 
  jhb 2007-10-27 22:07:40 UTC
 
FreeBSD src repository
 
Modified files:
  sys/kern sched_4bsd.c
Log:
Change the roundrobin implementation in the 4BSD scheduler to trigger 
a
userland preemption directly from hardclock() via sched_clock() when a
thread uses up a full quantum instead of using a periodic timeout to 

  cause

a userland preemption every so often.  This fixes a potential deadlock
when IPI_PREEMPTION isn't enabled where softclock blocks on a lock 
held
by a thread pinned or bound to another CPU.  The current thread on 
that
CPU will never be preempted while softclock is blocked.
 
Note that ULE already drives its round-robin userland preemption from
sched_clock() as well and always enables IPI_PREEMPT.
 
MFC after:  1 week
 
Revision  ChangesPath
1.108 +8 -29 src/sys/kern/sched_4bsd.c
 
  We use it at work on 6.x.  W/o this fix, round-robin stops working on 
4BSD 
  when softclock() (swi4: clock) blocks on a lock like Giant.


  I've been seeing similar troubles on 6.2 and I'll have to give this a 
  try as we upgrade to 6.3.  I notice MFC after: 1 week in the log; it's 
  been a week - any chance of seeing this fix rolled into 6.x?
  
 
  If people confirm it fixes issues I will MFC it.  There was some pushback 
when 
  I first committed it so I waited on the MFC.
 
 I can confirm that on 6.3 I can recreate the deadlock without the patch, 
 and can't recreate it with the patch.

Ok, I've merged it to RELENG_[67].

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output

2008-06-23 Thread John Baldwin
On Thursday 19 June 2008 11:57:51 am James Gritton wrote:
 John Baldwin wrote:
  On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:

  I've been trying to track down a deadlock on some newish production
  servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
  specific (although mundane) hardware configuration, and each of several
  servers running this hardware deadlock about once per week.
 
  Although I suspect that this is not hardware related, from a (naive)
  perusal of the attached stack traces.
 
  Forgive me if my interpretation of this is all wrong, but I'm pretty
  desperate for help. So here's my basic understanding of the deadlock:
 
  These processes seem to be waiting on the page queue mutex:
   sendmail (in vm_mmap  vm_map_find  vm_map_insert  vm_map_pmap_enter)
   bsnmpd (in malloc, uma_large_malloc  page_alloc  kmem_malloc)
   httpd (in trap  trap_pfault  vm_fault)
   [g_up] (in g_vfs_done  bufdone)
 
  The page queue mutex is held by rsync process:
   rsync (in trap  trap_pfault  vm_fault  pmap_enter)
 
  Rsync kernel process (in pmap_enter) was interrupted while holding the
  page queue lock?
 
 
  Giant is enabled in loader.conf due to the needs of the pf firewall when
  dealing with user credentials lookups. I do not believe that Giant plays
  into this deadlock. Kernel config attached.
 
  Any and all help or info is welcome. Thanks in advance.
  
 
  Try this change:
 
  jhb 2007-10-27 22:07:40 UTC
 
FreeBSD src repository
 
Modified files:
  sys/kern sched_4bsd.c
Log:
Change the roundrobin implementation in the 4BSD scheduler to trigger a
userland preemption directly from hardclock() via sched_clock() when a
thread uses up a full quantum instead of using a periodic timeout to 
cause
a userland preemption every so often.  This fixes a potential deadlock
when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held
by a thread pinned or bound to another CPU.  The current thread on that
CPU will never be preempted while softclock is blocked.
 
Note that ULE already drives its round-robin userland preemption from
sched_clock() as well and always enables IPI_PREEMPT.
 
MFC after:  1 week
 
Revision  ChangesPath
1.108 +8 -29 src/sys/kern/sched_4bsd.c
 
  We use it at work on 6.x.  W/o this fix, round-robin stops working on 4BSD 
  when softclock() (swi4: clock) blocks on a lock like Giant.

 
 I've been seeing similar troubles on 6.2 and I'll have to give this a 
 try as we upgrade to 6.3.  I notice MFC after: 1 week in the log; it's 
 been a week - any chance of seeing this fix rolled into 6.x?

If people confirm it fixes issues I will MFC it.  There was some pushback when 
I first committed it so I waited on the MFC.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output

2008-06-23 Thread James Gritton

John Baldwin wrote:

On Thursday 19 June 2008 11:57:51 am James Gritton wrote:
  

John Baldwin wrote:


On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:
  
  

I've been trying to track down a deadlock on some newish production
servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
specific (although mundane) hardware configuration, and each of several
servers running this hardware deadlock about once per week.

Although I suspect that this is not hardware related, from a (naive)
perusal of the attached stack traces.

Forgive me if my interpretation of this is all wrong, but I'm pretty
desperate for help. So here's my basic understanding of the deadlock:

These processes seem to be waiting on the page queue mutex:
 sendmail (in vm_mmap  vm_map_find  vm_map_insert  vm_map_pmap_enter)
 bsnmpd (in malloc, uma_large_malloc  page_alloc  kmem_malloc)
 httpd (in trap  trap_pfault  vm_fault)
 [g_up] (in g_vfs_done  bufdone)

The page queue mutex is held by rsync process:
 rsync (in trap  trap_pfault  vm_fault  pmap_enter)

Rsync kernel process (in pmap_enter) was interrupted while holding the
page queue lock?


Giant is enabled in loader.conf due to the needs of the pf firewall when
dealing with user credentials lookups. I do not believe that Giant plays
into this deadlock. Kernel config attached.

Any and all help or info is welcome. Thanks in advance.



Try this change:

jhb 2007-10-27 22:07:40 UTC

  FreeBSD src repository

  Modified files:
sys/kern sched_4bsd.c
  Log:
  Change the roundrobin implementation in the 4BSD scheduler to trigger a
  userland preemption directly from hardclock() via sched_clock() when a
  thread uses up a full quantum instead of using a periodic timeout to 
  

cause
  

  a userland preemption every so often.  This fixes a potential deadlock
  when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held
  by a thread pinned or bound to another CPU.  The current thread on that
  CPU will never be preempted while softclock is blocked.

  Note that ULE already drives its round-robin userland preemption from
  sched_clock() as well and always enables IPI_PREEMPT.

  MFC after:  1 week

  Revision  ChangesPath
  1.108 +8 -29 src/sys/kern/sched_4bsd.c

We use it at work on 6.x.  W/o this fix, round-robin stops working on 4BSD 
when softclock() (swi4: clock) blocks on a lock like Giant.
  
  
I've been seeing similar troubles on 6.2 and I'll have to give this a 
try as we upgrade to 6.3.  I notice MFC after: 1 week in the log; it's 
been a week - any chance of seeing this fix rolled into 6.x?



If people confirm it fixes issues I will MFC it.  There was some pushback when 
I first committed it so I waited on the MFC.


I can confirm that on 6.3 I can recreate the deadlock without the patch, 
and can't recreate it with the patch.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output

2008-06-19 Thread James Gritton

John Baldwin wrote:

On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:
  

I've been trying to track down a deadlock on some newish production
servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
specific (although mundane) hardware configuration, and each of several
servers running this hardware deadlock about once per week.

Although I suspect that this is not hardware related, from a (naive)
perusal of the attached stack traces.

Forgive me if my interpretation of this is all wrong, but I'm pretty
desperate for help. So here's my basic understanding of the deadlock:

These processes seem to be waiting on the page queue mutex:
 sendmail (in vm_mmap  vm_map_find  vm_map_insert  vm_map_pmap_enter)
 bsnmpd (in malloc, uma_large_malloc  page_alloc  kmem_malloc)
 httpd (in trap  trap_pfault  vm_fault)
 [g_up] (in g_vfs_done  bufdone)

The page queue mutex is held by rsync process:
 rsync (in trap  trap_pfault  vm_fault  pmap_enter)

Rsync kernel process (in pmap_enter) was interrupted while holding the
page queue lock?


Giant is enabled in loader.conf due to the needs of the pf firewall when
dealing with user credentials lookups. I do not believe that Giant plays
into this deadlock. Kernel config attached.

Any and all help or info is welcome. Thanks in advance.



Try this change:

jhb 2007-10-27 22:07:40 UTC

  FreeBSD src repository

  Modified files:
sys/kern sched_4bsd.c
  Log:
  Change the roundrobin implementation in the 4BSD scheduler to trigger a
  userland preemption directly from hardclock() via sched_clock() when a
  thread uses up a full quantum instead of using a periodic timeout to cause
  a userland preemption every so often.  This fixes a potential deadlock
  when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held
  by a thread pinned or bound to another CPU.  The current thread on that
  CPU will never be preempted while softclock is blocked.

  Note that ULE already drives its round-robin userland preemption from
  sched_clock() as well and always enables IPI_PREEMPT.

  MFC after:  1 week

  Revision  ChangesPath
  1.108 +8 -29 src/sys/kern/sched_4bsd.c

We use it at work on 6.x.  W/o this fix, round-robin stops working on 4BSD 
when softclock() (swi4: clock) blocks on a lock like Giant.
  


I've been seeing similar troubles on 6.2 and I'll have to give this a 
try as we upgrade to 6.3.  I notice MFC after: 1 week in the log; it's 
been a week - any chance of seeing this fix rolled into 6.x?


- Jamie
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output

2008-06-18 Thread John Baldwin
On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:
 I've been trying to track down a deadlock on some newish production
 servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
 specific (although mundane) hardware configuration, and each of several
 servers running this hardware deadlock about once per week.
 
 Although I suspect that this is not hardware related, from a (naive)
 perusal of the attached stack traces.
 
 Forgive me if my interpretation of this is all wrong, but I'm pretty
 desperate for help. So here's my basic understanding of the deadlock:
 
 These processes seem to be waiting on the page queue mutex:
  sendmail (in vm_mmap  vm_map_find  vm_map_insert  vm_map_pmap_enter)
  bsnmpd (in malloc, uma_large_malloc  page_alloc  kmem_malloc)
  httpd (in trap  trap_pfault  vm_fault)
  [g_up] (in g_vfs_done  bufdone)
 
 The page queue mutex is held by rsync process:
  rsync (in trap  trap_pfault  vm_fault  pmap_enter)
 
 Rsync kernel process (in pmap_enter) was interrupted while holding the
 page queue lock?
 
 
 Giant is enabled in loader.conf due to the needs of the pf firewall when
 dealing with user credentials lookups. I do not believe that Giant plays
 into this deadlock. Kernel config attached.
 
 Any and all help or info is welcome. Thanks in advance.

Try this change:

jhb 2007-10-27 22:07:40 UTC

  FreeBSD src repository

  Modified files:
sys/kern sched_4bsd.c
  Log:
  Change the roundrobin implementation in the 4BSD scheduler to trigger a
  userland preemption directly from hardclock() via sched_clock() when a
  thread uses up a full quantum instead of using a periodic timeout to cause
  a userland preemption every so often.  This fixes a potential deadlock
  when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held
  by a thread pinned or bound to another CPU.  The current thread on that
  CPU will never be preempted while softclock is blocked.

  Note that ULE already drives its round-robin userland preemption from
  sched_clock() as well and always enables IPI_PREEMPT.

  MFC after:  1 week

  Revision  ChangesPath
  1.108 +8 -29 src/sys/kern/sched_4bsd.c

We use it at work on 6.x.  W/o this fix, round-robin stops working on 4BSD 
when softclock() (swi4: clock) blocks on a lock like Giant.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output

2008-06-18 Thread Stef
John Baldwin wrote:
 Try this change:
 
 jhb 2007-10-27 22:07:40 UTC

snip

 We use it at work on 6.x.  W/o this fix, round-robin stops working on 4BSD 
 when softclock() (swi4: clock) blocks on a lock like Giant.

Awesome. Thanks. That looks like it'll do the trick. I'll deploy it and
keep the list posted.

Stef

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]