Re: r244036 kernel hangs under load.

2012-12-13 Thread Rick Macklem
Konstantin Belousov wrote:
 On Wed, Dec 12, 2012 at 10:01:39PM -0500, Rick Macklem wrote:
  Konstantin Belousov wrote:
   On Tue, Dec 11, 2012 at 08:58:47PM -0500, Rick Macklem wrote:
Ok, I'll test r243598 and then r243599 and r243835, just to
see if it really is this.
   
I'll email when I have done this.
   If you test only r243598, I am sure that you would experience
   corruption.
   The r243599 should cause the deadlocks.
  
  I think you meant r243599 will result in corruptions and
  r243835 deadlocks.
 
  I have run r243598 for a while without a hang. (r243599 doesn't
  build) I haven't tried r243835 yet.
 
   
 
   Also, do you use the post-r244095 kernel ?
 
  Before and after. The most recent tests were post-r244095.
  (If anything the more recent kernels hang more easily.)
 
 
  
   Is your machine SMP ?
 
  Old, slow single core i386.

 Try this. Please note that this is mostly a debugging
 facility.

It seemed to help, but didn't stop the hangs completely.
r244125 without the patch would hang somewhere in a kernel
build. r244125 plus this patch ran almost 2 kernel builds
before it got hung.
  
   Can you try to look into the system state on the hang (on the
   kernel
   with the if (1 || patch applied) ? Using the ddb and recipe from
   the
   web page. Basically watch for a thread looping in the mnt_active
   iterator and threads owning vnode interlocks.
 
  Ok, there is only one process in the mnt_active iterator and its
  trace is as follows (syncer):
  dle+0x12d/frame 0xdfe33adc (I suspect the screen lost an 'I')
  intr_execute_handlers(c5e3d064,dfe33b20,0,dfe33b64,c0ec2115,...) at
  intr_execute_handlers+0x49/frame 0xdfe33afc
  lapic_handle_intr(3c,dfe33b20) at lapic_handle_intr+0x36/frame
  0xdfe33b10
  Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe33b10
  --- interrupt, eip = 0xc0eca8db, esp = 0xdfe33b60, ebp = 0xdfe33b64
  ---
  spinlock_exit(c128be90,4,c10b5017,130,1710,...) at
  spinlock_exit+0x2b/frame 0xdfe33b64
  __mtx_unlock_spin_flags(c128be90,0,c10b80be,25d,0,...) at
  __mtx_unlock_spin_flags+0x112/frame 0xdfe33b90
  kern_yield(,0,c10c75c9,127b,c8b05238,...) at
  kern_yield+0x125/frame 0xdfe33bbc
  __mnt_vnode_next_active(dfe33c08,c857ba80,c10c75c9,dac,5d7,...) at
  __mnt_vnode_next_active+0xda/frame 0xdfe33be0
  vfs_msync(c857ba80,2,2,e6b,c857ba80,...) at vfs_msync+0x175/frame
  0xdfe33c18
  sync_fsync(dfe33ca8,c85cf470,80400,c10c75c9,6f4,...) at
  sync_fsync+0x141/frame 0xdfe33c64
  VOP_FSYNC_APV(c12008a0,dfe33ca8,c10c75c9,6f4,4e20,...) at
  VOP_FSYNC_APV+0xb4/frame 0xdfe33c64
  sched_sync(0,dfe33d08,c10b0e10,3db,c85395a0,...) at
  sched_sync+0x399/frame 0xdfe33cc8
  fork_exit(c0b79420,0,dfe33d08) at fork_exit+0xc0/frame 0xdfe33cf4
  fork_trampoline() at fork_trampoline+0x8/frame 0xdfe33cf4
  --- trap 0, eip = 0, esp = 0xdfe33d40, ebp = 0 ---
 
  This process holds:
  exclusive lockmgr syncer (syncer) r = 0 (0xc85cf4c8) locked @
  kern/vfs_subr.c:1780
 
  The only other process that is doing anything in the VFS subsystem
  holds the vnode interlock. It's trace is:
  dle+0x12d/frame 0xdfe6a850
  intr_execute_handlers(c5f721c0,dfe6a894,0,dfe6a908,c0ec2115,...) at
  intr_execute_handlers+0x49/frame 0xdfe6a870
  lapic_handle_intr(31,dfe6a894) at lapic_handle_intr+0x36/frame
  0xdfe6a884
  Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe6a884
  --- interrupt, eip = 0xc0b2206a, esp = 0xdfe6a8d4, ebp = 0xdfe6a908
  ---
  witness_unlock(c8972a74,8,c10c75c9,965,0,...) at
  witness_unlock+0x3a/frame 0xdfe6a908
  __mtx_unlock_flags(c8972a84,0,c10c75c9,965,c89729fc,...) at
  __mtx_unlock_flags+0x9f/frame 0xdfe6a938
  vdropl(c89729fc,dfe6a974,c10c75c9,8e7,c1238020,...) at
  vdropl+0x63/frame 0xdfe6a95c
  vputx(dfe6aa04,c0b67acc,c89729fc,dfe6a9e4,dfe6abc4,...) at
  vputx+0x300/frame 0xdfe6a994
  vput(c89729fc,dfe6a9e4,dfe6abc4,31d,dfe6a9e4,...) at vput+0x10/frame
  0xdfe6a99c
  lookup(dfe6ab84,c857e000,0,ce,c13c83c8,...) at lookup+0x9bc/frame
  0xdfe6aa04
  namei(dfe6ab84,0,0,246,0,...) at namei+0x4fe/frame 0xdfe6aa80
  vn_open_cred(dfe6ab84,dfe6ac24,1a4,0,c5dd4580,...) at
  vn_open_cred+0x2c0/frame 0xdfe6ab40
  vn_open(dfe6ab84,dfe6ac24,1a4,c85922a0,c853a2d0,...) at
  vn_open+0x3b/frame 0xdfe6ab60
  kern_openat(c85c55e0,ff9c,2882dcc0,0,8001,...) at
  kern_openat+0x1e2/frame 0xdfe6ac0c
  kern_open(c85c55e0,2882dcc0,0,8000,1b6,...) at kern_open+0x35/frame
  0xdfe6ac2c
  sys_open(c85c55e0,dfe6accc,c02acde7,7307f55d,5e5b00,...) at
  sys_open+0x30/frame 0xdfe6ac48
  syscall(dfe6ad08) at syscall+0x2e5/frame 0xdfe6acfc
  Xint0x80_syscall() at Xint0x80_syscall+0x21/frame 0xdfe6acfc
  --- syscall (5, FreeBSD ELF32, sys_open), eip = 0x84a1667, esp =
  0xbfbfcffc, ebp = 0xbfbfd018 ---
 
  The locks this process holds are:
  exclusive sleep mutex vnode interlock (vnode interlock) r = 0
  (0x8972a74) locked @ kern/vfs_subr.c:2513
  shared lockmgr ufs (ufs) r = 0 (0xc8bd181c) locked @
  kern/vfs_subr.c:2161
 
  

Re: r244036 kernel hangs under load.

2012-12-13 Thread Rick Macklem
Konstantin Belousov wrote:
 On Wed, Dec 12, 2012 at 10:01:39PM -0500, Rick Macklem wrote:
  Konstantin Belousov wrote:
   On Tue, Dec 11, 2012 at 08:58:47PM -0500, Rick Macklem wrote:
Ok, I'll test r243598 and then r243599 and r243835, just to
see if it really is this.
   
I'll email when I have done this.
   If you test only r243598, I am sure that you would experience
   corruption.
   The r243599 should cause the deadlocks.
  
  I think you meant r243599 will result in corruptions and
  r243835 deadlocks.
 
  I have run r243598 for a while without a hang. (r243599 doesn't
  build) I haven't tried r243835 yet.
 
   
 
   Also, do you use the post-r244095 kernel ?
 
  Before and after. The most recent tests were post-r244095.
  (If anything the more recent kernels hang more easily.)
 
 
  
   Is your machine SMP ?
 
  Old, slow single core i386.

 Try this. Please note that this is mostly a debugging
 facility.

It seemed to help, but didn't stop the hangs completely.
r244125 without the patch would hang somewhere in a kernel
build. r244125 plus this patch ran almost 2 kernel builds
before it got hung.
  
   Can you try to look into the system state on the hang (on the
   kernel
   with the if (1 || patch applied) ? Using the ddb and recipe from
   the
   web page. Basically watch for a thread looping in the mnt_active
   iterator and threads owning vnode interlocks.
 
  Ok, there is only one process in the mnt_active iterator and its
  trace is as follows (syncer):
  dle+0x12d/frame 0xdfe33adc (I suspect the screen lost an 'I')
  intr_execute_handlers(c5e3d064,dfe33b20,0,dfe33b64,c0ec2115,...) at
  intr_execute_handlers+0x49/frame 0xdfe33afc
  lapic_handle_intr(3c,dfe33b20) at lapic_handle_intr+0x36/frame
  0xdfe33b10
  Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe33b10
  --- interrupt, eip = 0xc0eca8db, esp = 0xdfe33b60, ebp = 0xdfe33b64
  ---
  spinlock_exit(c128be90,4,c10b5017,130,1710,...) at
  spinlock_exit+0x2b/frame 0xdfe33b64
  __mtx_unlock_spin_flags(c128be90,0,c10b80be,25d,0,...) at
  __mtx_unlock_spin_flags+0x112/frame 0xdfe33b90
  kern_yield(,0,c10c75c9,127b,c8b05238,...) at
  kern_yield+0x125/frame 0xdfe33bbc
  __mnt_vnode_next_active(dfe33c08,c857ba80,c10c75c9,dac,5d7,...) at
  __mnt_vnode_next_active+0xda/frame 0xdfe33be0
  vfs_msync(c857ba80,2,2,e6b,c857ba80,...) at vfs_msync+0x175/frame
  0xdfe33c18
  sync_fsync(dfe33ca8,c85cf470,80400,c10c75c9,6f4,...) at
  sync_fsync+0x141/frame 0xdfe33c64
  VOP_FSYNC_APV(c12008a0,dfe33ca8,c10c75c9,6f4,4e20,...) at
  VOP_FSYNC_APV+0xb4/frame 0xdfe33c64
  sched_sync(0,dfe33d08,c10b0e10,3db,c85395a0,...) at
  sched_sync+0x399/frame 0xdfe33cc8
  fork_exit(c0b79420,0,dfe33d08) at fork_exit+0xc0/frame 0xdfe33cf4
  fork_trampoline() at fork_trampoline+0x8/frame 0xdfe33cf4
  --- trap 0, eip = 0, esp = 0xdfe33d40, ebp = 0 ---
 
  This process holds:
  exclusive lockmgr syncer (syncer) r = 0 (0xc85cf4c8) locked @
  kern/vfs_subr.c:1780
 
  The only other process that is doing anything in the VFS subsystem
  holds the vnode interlock. It's trace is:
  dle+0x12d/frame 0xdfe6a850
  intr_execute_handlers(c5f721c0,dfe6a894,0,dfe6a908,c0ec2115,...) at
  intr_execute_handlers+0x49/frame 0xdfe6a870
  lapic_handle_intr(31,dfe6a894) at lapic_handle_intr+0x36/frame
  0xdfe6a884
  Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe6a884
  --- interrupt, eip = 0xc0b2206a, esp = 0xdfe6a8d4, ebp = 0xdfe6a908
  ---
  witness_unlock(c8972a74,8,c10c75c9,965,0,...) at
  witness_unlock+0x3a/frame 0xdfe6a908
  __mtx_unlock_flags(c8972a84,0,c10c75c9,965,c89729fc,...) at
  __mtx_unlock_flags+0x9f/frame 0xdfe6a938
  vdropl(c89729fc,dfe6a974,c10c75c9,8e7,c1238020,...) at
  vdropl+0x63/frame 0xdfe6a95c
  vputx(dfe6aa04,c0b67acc,c89729fc,dfe6a9e4,dfe6abc4,...) at
  vputx+0x300/frame 0xdfe6a994
  vput(c89729fc,dfe6a9e4,dfe6abc4,31d,dfe6a9e4,...) at vput+0x10/frame
  0xdfe6a99c
  lookup(dfe6ab84,c857e000,0,ce,c13c83c8,...) at lookup+0x9bc/frame
  0xdfe6aa04
  namei(dfe6ab84,0,0,246,0,...) at namei+0x4fe/frame 0xdfe6aa80
  vn_open_cred(dfe6ab84,dfe6ac24,1a4,0,c5dd4580,...) at
  vn_open_cred+0x2c0/frame 0xdfe6ab40
  vn_open(dfe6ab84,dfe6ac24,1a4,c85922a0,c853a2d0,...) at
  vn_open+0x3b/frame 0xdfe6ab60
  kern_openat(c85c55e0,ff9c,2882dcc0,0,8001,...) at
  kern_openat+0x1e2/frame 0xdfe6ac0c
  kern_open(c85c55e0,2882dcc0,0,8000,1b6,...) at kern_open+0x35/frame
  0xdfe6ac2c
  sys_open(c85c55e0,dfe6accc,c02acde7,7307f55d,5e5b00,...) at
  sys_open+0x30/frame 0xdfe6ac48
  syscall(dfe6ad08) at syscall+0x2e5/frame 0xdfe6acfc
  Xint0x80_syscall() at Xint0x80_syscall+0x21/frame 0xdfe6acfc
  --- syscall (5, FreeBSD ELF32, sys_open), eip = 0x84a1667, esp =
  0xbfbfcffc, ebp = 0xbfbfd018 ---
 
  The locks this process holds are:
  exclusive sleep mutex vnode interlock (vnode interlock) r = 0
  (0x8972a74) locked @ kern/vfs_subr.c:2513
  shared lockmgr ufs (ufs) r = 0 (0xc8bd181c) locked @
  kern/vfs_subr.c:2161
 
  

Re: r244036 kernel hangs under load.

2012-12-13 Thread Konstantin Belousov
On Thu, Dec 13, 2012 at 10:14:29PM -0500, Rick Macklem wrote:
 Good work. This patch seems to have done the trick. I've run
 quite a few kernel build cycles without a hang. I'll keep running
 them, but I would have expected to see a hang by now.
 
 Maybe Tim can test the patch as well?
 (I needed to add #include sys/smp.h btw.)

Below is the current version of the patch, it is being tested by Peter Holm.
Compared to what I sent you earlier, it contain a bug fix only relevant if
you use quotas.

diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 67e078d..64d75fb 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -69,6 +69,7 @@ __FBSDID($FreeBSD$);
 #include sys/reboot.h
 #include sys/sched.h
 #include sys/sleepqueue.h
+#include sys/smp.h
 #include sys/stat.h
 #include sys/sysctl.h
 #include sys/syslog.h
@@ -4710,32 +4711,54 @@ __mnt_vnode_markerfree_all(struct vnode **mvp, struct 
mount *mp)
  * These are helper functions for filesystems to traverse their
  * active vnodes.  See MNT_VNODE_FOREACH_ACTIVE() in sys/mount.h
  */
-struct vnode *
-__mnt_vnode_next_active(struct vnode **mvp, struct mount *mp)
+static void
+mnt_vnode_markerfree_active(struct vnode **mvp, struct mount *mp)
+{
+
+   KASSERT((*mvp)-v_mount == mp, (marker vnode mount list mismatch));
+
+   MNT_ILOCK(mp);
+   MNT_REL(mp);
+   MNT_IUNLOCK(mp);
+   free(*mvp, M_VNODE_MARKER);
+   *mvp = NULL;
+}
+
+#ifdef SMP
+#defineALWAYS_YIELD(mp_ncpus == 1)
+#else
+#defineALWAYS_YIELD1
+#endif
+
+static struct vnode *
+mnt_vnode_next_active(struct vnode **mvp, struct mount *mp)
 {
struct vnode *vp, *nvp;
 
-   if (should_yield())
-   kern_yield(PRI_UNCHANGED);
-   mtx_lock(vnode_free_list_mtx);
-restart:
+   mtx_assert(vnode_free_list_mtx, MA_OWNED);
KASSERT((*mvp)-v_mount == mp, (marker vnode mount list mismatch));
+restart:
vp = TAILQ_NEXT(*mvp, v_actfreelist);
+   TAILQ_REMOVE(mp-mnt_activevnodelist, *mvp, v_actfreelist);
while (vp != NULL) {
if (vp-v_type == VMARKER) {
vp = TAILQ_NEXT(vp, v_actfreelist);
continue;
}
if (!VI_TRYLOCK(vp)) {
-   if (should_yield()) {
+   if (ALWAYS_YIELD || should_yield()) {
+   TAILQ_INSERT_BEFORE(vp, *mvp, v_actfreelist);
mtx_unlock(vnode_free_list_mtx);
-   kern_yield(PRI_UNCHANGED);
+   kern_yield(PRI_USER);
mtx_lock(vnode_free_list_mtx);
+   goto restart;
}
-   goto restart;
+   continue;
}
-   if (vp-v_mount == mp  vp-v_type != VMARKER 
-   (vp-v_iflag  VI_DOOMED) == 0)
+   KASSERT(vp-v_type != VMARKER, (locked marker %p, vp));
+   KASSERT(vp-v_mount == mp || vp-v_mount == NULL,
+   (alien vnode on the active list %p %p, vp, mp));
+   if (vp-v_mount == mp  (vp-v_iflag  VI_DOOMED) == 0)
break;
nvp = TAILQ_NEXT(vp, v_actfreelist);
VI_UNLOCK(vp);
@@ -4745,22 +4768,31 @@ restart:
/* Check if we are done */
if (vp == NULL) {
mtx_unlock(vnode_free_list_mtx);
-   __mnt_vnode_markerfree_active(mvp, mp);
-   mtx_assert(MNT_MTX(mp), MA_NOTOWNED);
+   mnt_vnode_markerfree_active(mvp, mp);
return (NULL);
}
-   TAILQ_REMOVE(mp-mnt_activevnodelist, *mvp, v_actfreelist);
TAILQ_INSERT_AFTER(mp-mnt_activevnodelist, vp, *mvp, v_actfreelist);
mtx_unlock(vnode_free_list_mtx);
ASSERT_VI_LOCKED(vp, active iter);
KASSERT((vp-v_iflag  VI_ACTIVE) != 0, (Non-active vp %p, vp));
return (vp);
 }
+#undef ALWAYS_YIELD
+
+struct vnode *
+__mnt_vnode_next_active(struct vnode **mvp, struct mount *mp)
+{
+
+   if (should_yield())
+   kern_yield(PRI_UNCHANGED);
+   mtx_lock(vnode_free_list_mtx);
+   return (mnt_vnode_next_active(mvp, mp));
+}
 
 struct vnode *
 __mnt_vnode_first_active(struct vnode **mvp, struct mount *mp)
 {
-   struct vnode *vp, *nvp;
+   struct vnode *vp;
 
*mvp = malloc(sizeof(struct vnode), M_VNODE_MARKER, M_WAITOK | M_ZERO);
MNT_ILOCK(mp);
@@ -4770,44 +4802,14 @@ __mnt_vnode_first_active(struct vnode **mvp, struct 
mount *mp)
(*mvp)-v_mount = mp;
 
mtx_lock(vnode_free_list_mtx);
-restart:
vp = TAILQ_FIRST(mp-mnt_activevnodelist);
-   while (vp != NULL) {
-   if (vp-v_type == VMARKER) {
-   vp = TAILQ_NEXT(vp, v_actfreelist);
-   continue;
-   }
-   if (!VI_TRYLOCK(vp)) {
-   

Re: r244036 kernel hangs under load.

2012-12-12 Thread Rick Macklem
Konstantin Belousov wrote:
 On Tue, Dec 11, 2012 at 08:58:47PM -0500, Rick Macklem wrote:
  Ok, I'll test r243598 and then r243599 and r243835, just to
  see if it really is this.
 
  I'll email when I have done this.
 If you test only r243598, I am sure that you would experience
 corruption.
 The r243599 should cause the deadlocks.
 
I think you meant r243599 will result in corruptions and
r243835 deadlocks.

I have run r243598 for a while without a hang. (r243599 doesn't
build) I haven't tried r243835 yet.

 
   
 Also, do you use the post-r244095 kernel ?
   
Before and after. The most recent tests were post-r244095.
(If anything the more recent kernels hang more easily.)
   
   

 Is your machine SMP ?
   
Old, slow single core i386.
  
   Try this. Please note that this is mostly a debugging facility.
  
  It seemed to help, but didn't stop the hangs completely.
  r244125 without the patch would hang somewhere in a kernel
  build. r244125 plus this patch ran almost 2 kernel builds
  before it got hung.
 
 Can you try to look into the system state on the hang (on the kernel
 with the if (1 || patch applied) ? Using the ddb and recipe from the
 web page. Basically watch for a thread looping in the mnt_active
 iterator and threads owning vnode interlocks.

Ok, there is only one process in the mnt_active iterator and its
trace is as follows (syncer):
dle+0x12d/frame 0xdfe33adc (I suspect the screen lost an 'I')
intr_execute_handlers(c5e3d064,dfe33b20,0,dfe33b64,c0ec2115,...) at 
intr_execute_handlers+0x49/frame 0xdfe33afc
lapic_handle_intr(3c,dfe33b20) at lapic_handle_intr+0x36/frame 0xdfe33b10
Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe33b10
--- interrupt, eip = 0xc0eca8db, esp = 0xdfe33b60, ebp = 0xdfe33b64 ---
spinlock_exit(c128be90,4,c10b5017,130,1710,...) at spinlock_exit+0x2b/frame 
0xdfe33b64
__mtx_unlock_spin_flags(c128be90,0,c10b80be,25d,0,...) at 
__mtx_unlock_spin_flags+0x112/frame 0xdfe33b90
kern_yield(,0,c10c75c9,127b,c8b05238,...) at kern_yield+0x125/frame 
0xdfe33bbc
__mnt_vnode_next_active(dfe33c08,c857ba80,c10c75c9,dac,5d7,...) at 
__mnt_vnode_next_active+0xda/frame 0xdfe33be0
vfs_msync(c857ba80,2,2,e6b,c857ba80,...) at vfs_msync+0x175/frame 0xdfe33c18
sync_fsync(dfe33ca8,c85cf470,80400,c10c75c9,6f4,...) at sync_fsync+0x141/frame 
0xdfe33c64
VOP_FSYNC_APV(c12008a0,dfe33ca8,c10c75c9,6f4,4e20,...) at 
VOP_FSYNC_APV+0xb4/frame 0xdfe33c64
sched_sync(0,dfe33d08,c10b0e10,3db,c85395a0,...) at sched_sync+0x399/frame 
0xdfe33cc8
fork_exit(c0b79420,0,dfe33d08) at fork_exit+0xc0/frame 0xdfe33cf4
fork_trampoline() at fork_trampoline+0x8/frame 0xdfe33cf4
--- trap 0, eip = 0, esp = 0xdfe33d40, ebp = 0 ---

This process holds:
exclusive lockmgr syncer (syncer) r = 0 (0xc85cf4c8) locked @ 
kern/vfs_subr.c:1780

The only other process that is doing anything in the VFS subsystem
holds the vnode interlock. It's trace is:
dle+0x12d/frame 0xdfe6a850
intr_execute_handlers(c5f721c0,dfe6a894,0,dfe6a908,c0ec2115,...) at 
intr_execute_handlers+0x49/frame 0xdfe6a870
lapic_handle_intr(31,dfe6a894) at lapic_handle_intr+0x36/frame 0xdfe6a884
Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe6a884
--- interrupt, eip = 0xc0b2206a, esp = 0xdfe6a8d4, ebp = 0xdfe6a908 ---
witness_unlock(c8972a74,8,c10c75c9,965,0,...) at witness_unlock+0x3a/frame 
0xdfe6a908
__mtx_unlock_flags(c8972a84,0,c10c75c9,965,c89729fc,...) at 
__mtx_unlock_flags+0x9f/frame 0xdfe6a938
vdropl(c89729fc,dfe6a974,c10c75c9,8e7,c1238020,...) at vdropl+0x63/frame 
0xdfe6a95c
vputx(dfe6aa04,c0b67acc,c89729fc,dfe6a9e4,dfe6abc4,...) at vputx+0x300/frame 
0xdfe6a994
vput(c89729fc,dfe6a9e4,dfe6abc4,31d,dfe6a9e4,...) at vput+0x10/frame 0xdfe6a99c
lookup(dfe6ab84,c857e000,0,ce,c13c83c8,...) at lookup+0x9bc/frame 0xdfe6aa04
namei(dfe6ab84,0,0,246,0,...) at namei+0x4fe/frame 0xdfe6aa80
vn_open_cred(dfe6ab84,dfe6ac24,1a4,0,c5dd4580,...) at vn_open_cred+0x2c0/frame 
0xdfe6ab40
vn_open(dfe6ab84,dfe6ac24,1a4,c85922a0,c853a2d0,...) at vn_open+0x3b/frame 
0xdfe6ab60
kern_openat(c85c55e0,ff9c,2882dcc0,0,8001,...) at kern_openat+0x1e2/frame 
0xdfe6ac0c
kern_open(c85c55e0,2882dcc0,0,8000,1b6,...) at kern_open+0x35/frame 0xdfe6ac2c
sys_open(c85c55e0,dfe6accc,c02acde7,7307f55d,5e5b00,...) at sys_open+0x30/frame 
0xdfe6ac48
syscall(dfe6ad08) at syscall+0x2e5/frame 0xdfe6acfc
Xint0x80_syscall() at Xint0x80_syscall+0x21/frame 0xdfe6acfc
--- syscall (5, FreeBSD ELF32, sys_open), eip = 0x84a1667, esp = 0xbfbfcffc, 
ebp = 0xbfbfd018 ---

The locks this process holds are:
exclusive sleep mutex vnode interlock (vnode interlock) r = 0 (0x8972a74) 
locked @ kern/vfs_subr.c:2513
shared lockmgr ufs (ufs) r = 0 (0xc8bd181c) locked @ kern/vfs_subr.c:2161

The only other lock held by any thread/process is:
Process 12 (intr) thread 0xc5dfc5e0 (100012)
exclusive sleep mutex Giant (Giant) r = 1 (0xc127b690) locked @ 
dev/syscons/syscons.c:724

The only 2 locked vnodes are the ufs one and the syncer one, as
shown above.

The rest of the processes/threads 

Re: r244036 kernel hangs under load.

2012-12-12 Thread Konstantin Belousov
On Wed, Dec 12, 2012 at 10:01:39PM -0500, Rick Macklem wrote:
 Konstantin Belousov wrote:
  On Tue, Dec 11, 2012 at 08:58:47PM -0500, Rick Macklem wrote:
   Ok, I'll test r243598 and then r243599 and r243835, just to
   see if it really is this.
  
   I'll email when I have done this.
  If you test only r243598, I am sure that you would experience
  corruption.
  The r243599 should cause the deadlocks.
  
 I think you meant r243599 will result in corruptions and
 r243835 deadlocks.
 
 I have run r243598 for a while without a hang. (r243599 doesn't
 build) I haven't tried r243835 yet.
 
  

  Also, do you use the post-r244095 kernel ?

 Before and after. The most recent tests were post-r244095.
 (If anything the more recent kernels hang more easily.)


 
  Is your machine SMP ?

 Old, slow single core i386.
   
Try this. Please note that this is mostly a debugging facility.
   
   It seemed to help, but didn't stop the hangs completely.
   r244125 without the patch would hang somewhere in a kernel
   build. r244125 plus this patch ran almost 2 kernel builds
   before it got hung.
  
  Can you try to look into the system state on the hang (on the kernel
  with the if (1 || patch applied) ? Using the ddb and recipe from the
  web page. Basically watch for a thread looping in the mnt_active
  iterator and threads owning vnode interlocks.
 
 Ok, there is only one process in the mnt_active iterator and its
 trace is as follows (syncer):
 dle+0x12d/frame 0xdfe33adc (I suspect the screen lost an 'I')
 intr_execute_handlers(c5e3d064,dfe33b20,0,dfe33b64,c0ec2115,...) at 
 intr_execute_handlers+0x49/frame 0xdfe33afc
 lapic_handle_intr(3c,dfe33b20) at lapic_handle_intr+0x36/frame 0xdfe33b10
 Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe33b10
 --- interrupt, eip = 0xc0eca8db, esp = 0xdfe33b60, ebp = 0xdfe33b64 ---
 spinlock_exit(c128be90,4,c10b5017,130,1710,...) at spinlock_exit+0x2b/frame 
 0xdfe33b64
 __mtx_unlock_spin_flags(c128be90,0,c10b80be,25d,0,...) at 
 __mtx_unlock_spin_flags+0x112/frame 0xdfe33b90
 kern_yield(,0,c10c75c9,127b,c8b05238,...) at kern_yield+0x125/frame 
 0xdfe33bbc
 __mnt_vnode_next_active(dfe33c08,c857ba80,c10c75c9,dac,5d7,...) at 
 __mnt_vnode_next_active+0xda/frame 0xdfe33be0
 vfs_msync(c857ba80,2,2,e6b,c857ba80,...) at vfs_msync+0x175/frame 0xdfe33c18
 sync_fsync(dfe33ca8,c85cf470,80400,c10c75c9,6f4,...) at 
 sync_fsync+0x141/frame 0xdfe33c64
 VOP_FSYNC_APV(c12008a0,dfe33ca8,c10c75c9,6f4,4e20,...) at 
 VOP_FSYNC_APV+0xb4/frame 0xdfe33c64
 sched_sync(0,dfe33d08,c10b0e10,3db,c85395a0,...) at sched_sync+0x399/frame 
 0xdfe33cc8
 fork_exit(c0b79420,0,dfe33d08) at fork_exit+0xc0/frame 0xdfe33cf4
 fork_trampoline() at fork_trampoline+0x8/frame 0xdfe33cf4
 --- trap 0, eip = 0, esp = 0xdfe33d40, ebp = 0 ---
 
 This process holds:
 exclusive lockmgr syncer (syncer) r = 0 (0xc85cf4c8) locked @ 
 kern/vfs_subr.c:1780
 
 The only other process that is doing anything in the VFS subsystem
 holds the vnode interlock. It's trace is:
 dle+0x12d/frame 0xdfe6a850
 intr_execute_handlers(c5f721c0,dfe6a894,0,dfe6a908,c0ec2115,...) at 
 intr_execute_handlers+0x49/frame 0xdfe6a870
 lapic_handle_intr(31,dfe6a894) at lapic_handle_intr+0x36/frame 0xdfe6a884
 Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe6a884
 --- interrupt, eip = 0xc0b2206a, esp = 0xdfe6a8d4, ebp = 0xdfe6a908 ---
 witness_unlock(c8972a74,8,c10c75c9,965,0,...) at witness_unlock+0x3a/frame 
 0xdfe6a908
 __mtx_unlock_flags(c8972a84,0,c10c75c9,965,c89729fc,...) at 
 __mtx_unlock_flags+0x9f/frame 0xdfe6a938
 vdropl(c89729fc,dfe6a974,c10c75c9,8e7,c1238020,...) at vdropl+0x63/frame 
 0xdfe6a95c
 vputx(dfe6aa04,c0b67acc,c89729fc,dfe6a9e4,dfe6abc4,...) at vputx+0x300/frame 
 0xdfe6a994
 vput(c89729fc,dfe6a9e4,dfe6abc4,31d,dfe6a9e4,...) at vput+0x10/frame 
 0xdfe6a99c
 lookup(dfe6ab84,c857e000,0,ce,c13c83c8,...) at lookup+0x9bc/frame 0xdfe6aa04
 namei(dfe6ab84,0,0,246,0,...) at namei+0x4fe/frame 0xdfe6aa80
 vn_open_cred(dfe6ab84,dfe6ac24,1a4,0,c5dd4580,...) at 
 vn_open_cred+0x2c0/frame 0xdfe6ab40
 vn_open(dfe6ab84,dfe6ac24,1a4,c85922a0,c853a2d0,...) at vn_open+0x3b/frame 
 0xdfe6ab60
 kern_openat(c85c55e0,ff9c,2882dcc0,0,8001,...) at kern_openat+0x1e2/frame 
 0xdfe6ac0c
 kern_open(c85c55e0,2882dcc0,0,8000,1b6,...) at kern_open+0x35/frame 0xdfe6ac2c
 sys_open(c85c55e0,dfe6accc,c02acde7,7307f55d,5e5b00,...) at 
 sys_open+0x30/frame 0xdfe6ac48
 syscall(dfe6ad08) at syscall+0x2e5/frame 0xdfe6acfc
 Xint0x80_syscall() at Xint0x80_syscall+0x21/frame 0xdfe6acfc
 --- syscall (5, FreeBSD ELF32, sys_open), eip = 0x84a1667, esp = 0xbfbfcffc, 
 ebp = 0xbfbfd018 ---
 
 The locks this process holds are:
 exclusive sleep mutex vnode interlock (vnode interlock) r = 0 (0x8972a74) 
 locked @ kern/vfs_subr.c:2513
 shared lockmgr ufs (ufs) r = 0 (0xc8bd181c) locked @ kern/vfs_subr.c:2161
 
 The only other lock held by any thread/process is:
 Process 12 (intr) thread 0xc5dfc5e0 (100012)
 exclusive sleep mutex 

Re: r244036 kernel hangs under load.

2012-12-11 Thread Rick Macklem
Konstantin Belousov wrote:
 On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
  Konstantin Belousov wrote:
   On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
Adrian Chadd wrote:
 .. what was the previous kernel version?

Hopefully Tim has it narrowed down more, but I don't see
the hangs on a Sept. 7 kernel from head and I do see them
on a Dec. 3 kernel from head. (Don't know the eact rNN.)
   
It seems to predate my commit (r244008), which was my first
concern.
   
I use old single core i386 hardware and can fairly reliably
reproduce it by doing a kernel build and a svn checkout
concurrently. No NFS activity. These are running on a local
disk (UFS/FFS). (The kernel I reproduce it on is built via
GENERIC for i386. If you want me to start a binary search
for which rNN, I can do that, but it will take a while.:-)
   
I can get out into DDB, but I'll admit I don't know enough
about it to know where to look;-)
Here's some lines from db ps, in case they give someone
useful information. (I can leave this box sitting in DB for
the rest of to-day, in case someone can suggest what I should
look for on it.)
   
Just snippets...
   Ss pause adjkerntz
   DL sdflush [sofdepflush]
   RL [syncer]
   DL vlruwt [vnlru]
   DL psleep [bufdaemon]
   RL [pagezero]
   DL psleep [vmdaemon]
   DL psleep [pagedaemon]
   DL ccb_scan [xpt_thrd]
   DL waiting_ [sctp_iterator]
   DL ctl_work [ctl_thrd]
   DL cooling [acpi_cooling0]
   DL tzpoll [acpi_thermal]
   DL (threaded) [usb]
   ...
   DL - [yarrow]
   DL (threaded) [geom]
   D - [g_down]
   D - [g_up]
   D - [g_event]
   RL (threaded) [intr]
   I [irq15: ata1]
   ...
   Run CPU0 [swi6: Giant taskq]
-- does this one indicate the CPU is actually running this?
   (after a db cont, wait a while ctrlaltesc db ps
it is still the same)
   I [swi4: clock]
   I [swi1: netisr 0]
   I [swi3: vm]
   RL [idle: cpu0]
   SLs wait [init]
   DL audit_wo [audit]
   DLs (threaded) [kernel]
   D - [deadlkres]
   ...
   D sched [swapper]
   
I have no idea if this ps output helps, unless it indicates
that it is looping on the Giant taskq?
   Might be. You could do 'bt pid' for the process to see where it
   loops.
   Another good set of hints is at
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
 
  Kostik, you must be clairvoyant;-)
 
  When I did show alllocks, I found that the syncer process held
  - exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720
  - exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
  The trace for this process goes like:
   spinlock_exit
   mtx_unlock_spin_flags
   kern_yield
   _mnt_vnode_next_active
   vnode_next_active
   vfs_msync()
 
  So, it seems like your r244095 commit might have fixed this?
  (I'm not good at this stuff, but from your description, it looks
   like it did the kern_yield() with the mutex held and maybe
   got into trouble trying to acquire Giant?)
 
  Anyhow, I'm going to test a kernel with r244095 in it and see
  if I can still reproduce the hang.
  (There wasn't much else in the show alllocks, except a
   process that held the exclusive vnode interlock mutex plus
   a ufs vnode lock, but it's just doing a witness_unlock.)
 There must be a thread blocked for the mount interlock for the loop
 in the mnt_vnode_next_active to cause livelock.
 
Yes. I am getting hangs with the -current kernel and they seem
easier for me to reproduce.

For the one I just did, the syncer seems to be blocked at
 VI_TRYLOCK() in _mnt_vnode_next_active().
The vnode interlock mutex is eclusively locked by a sh
process (11627). Now, here is where it gets weird...
When I do a db trace 11627 I get the following:
witness_unlock+0x1f3  (subr_witness.c:1563)
mtx_unlock_flags+0x9f (kern_mutex.c:250)
vdropl+0x63   (vfs_subr.c:2405)
vputx+0x130   (vfs_subr.c:2116)
vput+0x10 (vfs_subr.c:2319)
vm_mmap+0x52e (vm_mmap.c:1341)
sys_mmap

So, it seems this process is stuck while trying to unlock
the mutex, if that makes any sense...

When I look at subr_witness.c around 1563, I can't see
why anything would block there? (here's the code snippet)
if ((instance-li_flags  LI_EXCLUSIVE) == 0  witness_watch  0 
(flags  LOP_EXCLUSIVE) != 0) {
printf(exclusive unlock of (%s) %s @ %s:%d\n, class-lc_name,
lock-lo_name, fixup_filename(file), line);
printf(while share locked from %s:%d\n,
fixup_filename(instance-li_file),
instance-li_line);
panic(share-uexcl);
}
/* If we are recursed, unrecurse. */
*1563*  if ((instance-li_flags  LI_RECURSEMASK)  0) {
CTR4(KTR_WITNESS, 

Re: r244036 kernel hangs under load.

2012-12-11 Thread Attilio Rao
On Tue, Dec 11, 2012 at 9:55 PM, Rick Macklem rmack...@uoguelph.ca wrote:
 Konstantin Belousov wrote:
 On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
  Konstantin Belousov wrote:
   On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
Adrian Chadd wrote:
 .. what was the previous kernel version?

Hopefully Tim has it narrowed down more, but I don't see
the hangs on a Sept. 7 kernel from head and I do see them
on a Dec. 3 kernel from head. (Don't know the eact rNN.)
   
It seems to predate my commit (r244008), which was my first
concern.
   
I use old single core i386 hardware and can fairly reliably
reproduce it by doing a kernel build and a svn checkout
concurrently. No NFS activity. These are running on a local
disk (UFS/FFS). (The kernel I reproduce it on is built via
GENERIC for i386. If you want me to start a binary search
for which rNN, I can do that, but it will take a while.:-)
   
I can get out into DDB, but I'll admit I don't know enough
about it to know where to look;-)
Here's some lines from db ps, in case they give someone
useful information. (I can leave this box sitting in DB for
the rest of to-day, in case someone can suggest what I should
look for on it.)
   
Just snippets...
   Ss pause adjkerntz
   DL sdflush [sofdepflush]
   RL [syncer]
   DL vlruwt [vnlru]
   DL psleep [bufdaemon]
   RL [pagezero]
   DL psleep [vmdaemon]
   DL psleep [pagedaemon]
   DL ccb_scan [xpt_thrd]
   DL waiting_ [sctp_iterator]
   DL ctl_work [ctl_thrd]
   DL cooling [acpi_cooling0]
   DL tzpoll [acpi_thermal]
   DL (threaded) [usb]
   ...
   DL - [yarrow]
   DL (threaded) [geom]
   D - [g_down]
   D - [g_up]
   D - [g_event]
   RL (threaded) [intr]
   I [irq15: ata1]
   ...
   Run CPU0 [swi6: Giant taskq]
-- does this one indicate the CPU is actually running this?
   (after a db cont, wait a while ctrlaltesc db ps
it is still the same)
   I [swi4: clock]
   I [swi1: netisr 0]
   I [swi3: vm]
   RL [idle: cpu0]
   SLs wait [init]
   DL audit_wo [audit]
   DLs (threaded) [kernel]
   D - [deadlkres]
   ...
   D sched [swapper]
   
I have no idea if this ps output helps, unless it indicates
that it is looping on the Giant taskq?
   Might be. You could do 'bt pid' for the process to see where it
   loops.
   Another good set of hints is at
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
 
  Kostik, you must be clairvoyant;-)
 
  When I did show alllocks, I found that the syncer process held
  - exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720
  - exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
  The trace for this process goes like:
   spinlock_exit
   mtx_unlock_spin_flags
   kern_yield
   _mnt_vnode_next_active
   vnode_next_active
   vfs_msync()
 
  So, it seems like your r244095 commit might have fixed this?
  (I'm not good at this stuff, but from your description, it looks
   like it did the kern_yield() with the mutex held and maybe
   got into trouble trying to acquire Giant?)
 
  Anyhow, I'm going to test a kernel with r244095 in it and see
  if I can still reproduce the hang.
  (There wasn't much else in the show alllocks, except a
   process that held the exclusive vnode interlock mutex plus
   a ufs vnode lock, but it's just doing a witness_unlock.)
 There must be a thread blocked for the mount interlock for the loop
 in the mnt_vnode_next_active to cause livelock.

 Yes. I am getting hangs with the -current kernel and they seem
 easier for me to reproduce.

Can you report the svn rev number is kernel is built from?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: r244036 kernel hangs under load.

2012-12-11 Thread Konstantin Belousov
On Tue, Dec 11, 2012 at 04:55:52PM -0500, Rick Macklem wrote:
 Konstantin Belousov wrote:
  On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
   Konstantin Belousov wrote:
On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
 Adrian Chadd wrote:
  .. what was the previous kernel version?
 
 Hopefully Tim has it narrowed down more, but I don't see
 the hangs on a Sept. 7 kernel from head and I do see them
 on a Dec. 3 kernel from head. (Don't know the eact rNN.)

 It seems to predate my commit (r244008), which was my first
 concern.

 I use old single core i386 hardware and can fairly reliably
 reproduce it by doing a kernel build and a svn checkout
 concurrently. No NFS activity. These are running on a local
 disk (UFS/FFS). (The kernel I reproduce it on is built via
 GENERIC for i386. If you want me to start a binary search
 for which rNN, I can do that, but it will take a while.:-)

 I can get out into DDB, but I'll admit I don't know enough
 about it to know where to look;-)
 Here's some lines from db ps, in case they give someone
 useful information. (I can leave this box sitting in DB for
 the rest of to-day, in case someone can suggest what I should
 look for on it.)

 Just snippets...
Ss pause adjkerntz
DL sdflush [sofdepflush]
RL [syncer]
DL vlruwt [vnlru]
DL psleep [bufdaemon]
RL [pagezero]
DL psleep [vmdaemon]
DL psleep [pagedaemon]
DL ccb_scan [xpt_thrd]
DL waiting_ [sctp_iterator]
DL ctl_work [ctl_thrd]
DL cooling [acpi_cooling0]
DL tzpoll [acpi_thermal]
DL (threaded) [usb]
...
DL - [yarrow]
DL (threaded) [geom]
D - [g_down]
D - [g_up]
D - [g_event]
RL (threaded) [intr]
I [irq15: ata1]
...
Run CPU0 [swi6: Giant taskq]
 -- does this one indicate the CPU is actually running this?
(after a db cont, wait a while ctrlaltesc db ps
 it is still the same)
I [swi4: clock]
I [swi1: netisr 0]
I [swi3: vm]
RL [idle: cpu0]
SLs wait [init]
DL audit_wo [audit]
DLs (threaded) [kernel]
D - [deadlkres]
...
D sched [swapper]

 I have no idea if this ps output helps, unless it indicates
 that it is looping on the Giant taskq?
Might be. You could do 'bt pid' for the process to see where it
loops.
Another good set of hints is at
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
  
   Kostik, you must be clairvoyant;-)
  
   When I did show alllocks, I found that the syncer process held
   - exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720
   - exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
   The trace for this process goes like:
spinlock_exit
mtx_unlock_spin_flags
kern_yield
_mnt_vnode_next_active
vnode_next_active
vfs_msync()
  
   So, it seems like your r244095 commit might have fixed this?
   (I'm not good at this stuff, but from your description, it looks
like it did the kern_yield() with the mutex held and maybe
got into trouble trying to acquire Giant?)
  
   Anyhow, I'm going to test a kernel with r244095 in it and see
   if I can still reproduce the hang.
   (There wasn't much else in the show alllocks, except a
process that held the exclusive vnode interlock mutex plus
a ufs vnode lock, but it's just doing a witness_unlock.)
  There must be a thread blocked for the mount interlock for the loop
  in the mnt_vnode_next_active to cause livelock.
  
 Yes. I am getting hangs with the -current kernel and they seem
 easier for me to reproduce.
 
 For the one I just did, the syncer seems to be blocked at
  VI_TRYLOCK() in _mnt_vnode_next_active().
trylock cannot block.

 The vnode interlock mutex is eclusively locked by a sh
 process (11627). Now, here is where it gets weird...
 When I do a db trace 11627 I get the following:
 witness_unlock+0x1f3  (subr_witness.c:1563)
 mtx_unlock_flags+0x9f (kern_mutex.c:250)
 vdropl+0x63   (vfs_subr.c:2405)
 vputx+0x130   (vfs_subr.c:2116)
 vput+0x10 (vfs_subr.c:2319)
 vm_mmap+0x52e (vm_mmap.c:1341)
 sys_mmap
 
 So, it seems this process is stuck while trying to unlock
 the mutex, if that makes any sense...
It probably not stuck, but just you catched it at this moment.

The issue sounds more like a livelock. Can you obtain _all_ the information
listed in the deadlock debugging page I sent earlier, and provide it to
me ? Also, do you use the post-r244095 kernel ?

Is your machine SMP ?


pgpF4yXHmfYCH.pgp
Description: PGP signature


Re: r244036 kernel hangs under load.

2012-12-11 Thread Rick Macklem
Konstantin Belousov wrote:
 On Tue, Dec 11, 2012 at 04:55:52PM -0500, Rick Macklem wrote:
  Konstantin Belousov wrote:
   On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
Konstantin Belousov wrote:
 On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
  Adrian Chadd wrote:
   .. what was the previous kernel version?
  
  Hopefully Tim has it narrowed down more, but I don't see
  the hangs on a Sept. 7 kernel from head and I do see them
  on a Dec. 3 kernel from head. (Don't know the eact rNN.)
 
  It seems to predate my commit (r244008), which was my first
  concern.
 
  I use old single core i386 hardware and can fairly reliably
  reproduce it by doing a kernel build and a svn checkout
  concurrently. No NFS activity. These are running on a local
  disk (UFS/FFS). (The kernel I reproduce it on is built via
  GENERIC for i386. If you want me to start a binary search
  for which rNN, I can do that, but it will take a
  while.:-)
 
  I can get out into DDB, but I'll admit I don't know enough
  about it to know where to look;-)
  Here's some lines from db ps, in case they give someone
  useful information. (I can leave this box sitting in DB for
  the rest of to-day, in case someone can suggest what I
  should
  look for on it.)
 
  Just snippets...
 Ss pause adjkerntz
 DL sdflush [sofdepflush]
 RL [syncer]
 DL vlruwt [vnlru]
 DL psleep [bufdaemon]
 RL [pagezero]
 DL psleep [vmdaemon]
 DL psleep [pagedaemon]
 DL ccb_scan [xpt_thrd]
 DL waiting_ [sctp_iterator]
 DL ctl_work [ctl_thrd]
 DL cooling [acpi_cooling0]
 DL tzpoll [acpi_thermal]
 DL (threaded) [usb]
 ...
 DL - [yarrow]
 DL (threaded) [geom]
 D - [g_down]
 D - [g_up]
 D - [g_event]
 RL (threaded) [intr]
 I [irq15: ata1]
 ...
 Run CPU0 [swi6: Giant taskq]
  -- does this one indicate the CPU is actually running this?
 (after a db cont, wait a while ctrlaltesc db ps
  it is still the same)
 I [swi4: clock]
 I [swi1: netisr 0]
 I [swi3: vm]
 RL [idle: cpu0]
 SLs wait [init]
 DL audit_wo [audit]
 DLs (threaded) [kernel]
 D - [deadlkres]
 ...
 D sched [swapper]
 
  I have no idea if this ps output helps, unless it
  indicates
  that it is looping on the Giant taskq?
 Might be. You could do 'bt pid' for the process to see where
 it
 loops.
 Another good set of hints is at
 http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
   
Kostik, you must be clairvoyant;-)
   
When I did show alllocks, I found that the syncer process held
- exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720
- exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
The trace for this process goes like:
 spinlock_exit
 mtx_unlock_spin_flags
 kern_yield
 _mnt_vnode_next_active
 vnode_next_active
 vfs_msync()
   
So, it seems like your r244095 commit might have fixed this?
(I'm not good at this stuff, but from your description, it looks
 like it did the kern_yield() with the mutex held and maybe
 got into trouble trying to acquire Giant?)
   
Anyhow, I'm going to test a kernel with r244095 in it and see
if I can still reproduce the hang.
(There wasn't much else in the show alllocks, except a
 process that held the exclusive vnode interlock mutex plus
 a ufs vnode lock, but it's just doing a witness_unlock.)
   There must be a thread blocked for the mount interlock for the
   loop
   in the mnt_vnode_next_active to cause livelock.
  
  Yes. I am getting hangs with the -current kernel and they seem
  easier for me to reproduce.
 
  For the one I just did, the syncer seems to be blocked at
   VI_TRYLOCK() in _mnt_vnode_next_active().
 trylock cannot block.
 
  The vnode interlock mutex is eclusively locked by a sh
  process (11627). Now, here is where it gets weird...
  When I do a db trace 11627 I get the following:
  witness_unlock+0x1f3 (subr_witness.c:1563)
  mtx_unlock_flags+0x9f (kern_mutex.c:250)
  vdropl+0x63 (vfs_subr.c:2405)
  vputx+0x130 (vfs_subr.c:2116)
  vput+0x10 (vfs_subr.c:2319)
  vm_mmap+0x52e (vm_mmap.c:1341)
  sys_mmap
 
  So, it seems this process is stuck while trying to unlock
  the mutex, if that makes any sense...
 It probably not stuck, but just you catched it at this moment.
 
 The issue sounds more like a livelock. Can you obtain _all_ the
 information
 listed in the deadlock debugging page I sent earlier, and provide it
 to
 me ?
Well, this is a laptop and when it hangs (doesn't do anything, except
sometimes echo characters on the console screen) I ctrlaltesc
to get 

Re: r244036 kernel hangs under load.

2012-12-11 Thread Konstantin Belousov
On Tue, Dec 11, 2012 at 05:30:24PM -0500, Rick Macklem wrote:
 Konstantin Belousov wrote:
  On Tue, Dec 11, 2012 at 04:55:52PM -0500, Rick Macklem wrote:
   Konstantin Belousov wrote:
On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
 Konstantin Belousov wrote:
  On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
   Adrian Chadd wrote:
.. what was the previous kernel version?
   
   Hopefully Tim has it narrowed down more, but I don't see
   the hangs on a Sept. 7 kernel from head and I do see them
   on a Dec. 3 kernel from head. (Don't know the eact rNN.)
  
   It seems to predate my commit (r244008), which was my first
   concern.
  
   I use old single core i386 hardware and can fairly reliably
   reproduce it by doing a kernel build and a svn checkout
   concurrently. No NFS activity. These are running on a local
   disk (UFS/FFS). (The kernel I reproduce it on is built via
   GENERIC for i386. If you want me to start a binary search
   for which rNN, I can do that, but it will take a
   while.:-)
  
   I can get out into DDB, but I'll admit I don't know enough
   about it to know where to look;-)
   Here's some lines from db ps, in case they give someone
   useful information. (I can leave this box sitting in DB for
   the rest of to-day, in case someone can suggest what I
   should
   look for on it.)
  
   Just snippets...
  Ss pause adjkerntz
  DL sdflush [sofdepflush]
  RL [syncer]
  DL vlruwt [vnlru]
  DL psleep [bufdaemon]
  RL [pagezero]
  DL psleep [vmdaemon]
  DL psleep [pagedaemon]
  DL ccb_scan [xpt_thrd]
  DL waiting_ [sctp_iterator]
  DL ctl_work [ctl_thrd]
  DL cooling [acpi_cooling0]
  DL tzpoll [acpi_thermal]
  DL (threaded) [usb]
  ...
  DL - [yarrow]
  DL (threaded) [geom]
  D - [g_down]
  D - [g_up]
  D - [g_event]
  RL (threaded) [intr]
  I [irq15: ata1]
  ...
  Run CPU0 [swi6: Giant taskq]
   -- does this one indicate the CPU is actually running this?
  (after a db cont, wait a while ctrlaltesc db ps
   it is still the same)
  I [swi4: clock]
  I [swi1: netisr 0]
  I [swi3: vm]
  RL [idle: cpu0]
  SLs wait [init]
  DL audit_wo [audit]
  DLs (threaded) [kernel]
  D - [deadlkres]
  ...
  D sched [swapper]
  
   I have no idea if this ps output helps, unless it
   indicates
   that it is looping on the Giant taskq?
  Might be. You could do 'bt pid' for the process to see where
  it
  loops.
  Another good set of hints is at
  http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

 Kostik, you must be clairvoyant;-)

 When I did show alllocks, I found that the syncer process held
 - exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720
 - exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
 The trace for this process goes like:
  spinlock_exit
  mtx_unlock_spin_flags
  kern_yield
  _mnt_vnode_next_active
  vnode_next_active
  vfs_msync()

 So, it seems like your r244095 commit might have fixed this?
 (I'm not good at this stuff, but from your description, it looks
  like it did the kern_yield() with the mutex held and maybe
  got into trouble trying to acquire Giant?)

 Anyhow, I'm going to test a kernel with r244095 in it and see
 if I can still reproduce the hang.
 (There wasn't much else in the show alllocks, except a
  process that held the exclusive vnode interlock mutex plus
  a ufs vnode lock, but it's just doing a witness_unlock.)
There must be a thread blocked for the mount interlock for the
loop
in the mnt_vnode_next_active to cause livelock.
   
   Yes. I am getting hangs with the -current kernel and they seem
   easier for me to reproduce.
  
   For the one I just did, the syncer seems to be blocked at
VI_TRYLOCK() in _mnt_vnode_next_active().
  trylock cannot block.
  
   The vnode interlock mutex is eclusively locked by a sh
   process (11627). Now, here is where it gets weird...
   When I do a db trace 11627 I get the following:
   witness_unlock+0x1f3 (subr_witness.c:1563)
   mtx_unlock_flags+0x9f (kern_mutex.c:250)
   vdropl+0x63 (vfs_subr.c:2405)
   vputx+0x130 (vfs_subr.c:2116)
   vput+0x10 (vfs_subr.c:2319)
   vm_mmap+0x52e (vm_mmap.c:1341)
   sys_mmap
  
   So, it seems this process is stuck while trying to unlock
   the mutex, if that makes any sense...
  It probably not stuck, but just you catched it at this moment.
  
  The issue sounds more like a livelock. Can you obtain _all_ the
  information
  listed in the deadlock 

Re: r244036 kernel hangs under load.

2012-12-11 Thread Rick Macklem
Konstantin Belousov wrote:
 On Tue, Dec 11, 2012 at 05:30:24PM -0500, Rick Macklem wrote:
  Konstantin Belousov wrote:
   On Tue, Dec 11, 2012 at 04:55:52PM -0500, Rick Macklem wrote:
Konstantin Belousov wrote:
 On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
  Konstantin Belousov wrote:
   On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem
   wrote:
Adrian Chadd wrote:
 .. what was the previous kernel version?

Hopefully Tim has it narrowed down more, but I don't see
the hangs on a Sept. 7 kernel from head and I do see
them
on a Dec. 3 kernel from head. (Don't know the eact
rNN.)
   
It seems to predate my commit (r244008), which was my
first
concern.
   
I use old single core i386 hardware and can fairly
reliably
reproduce it by doing a kernel build and a svn
checkout
concurrently. No NFS activity. These are running on a
local
disk (UFS/FFS). (The kernel I reproduce it on is built
via
GENERIC for i386. If you want me to start a binary
search
for which rNN, I can do that, but it will take a
while.:-)
   
I can get out into DDB, but I'll admit I don't know
enough
about it to know where to look;-)
Here's some lines from db ps, in case they give
someone
useful information. (I can leave this box sitting in DB
for
the rest of to-day, in case someone can suggest what I
should
look for on it.)
   
Just snippets...
   Ss pause adjkerntz
   DL sdflush [sofdepflush]
   RL [syncer]
   DL vlruwt [vnlru]
   DL psleep [bufdaemon]
   RL [pagezero]
   DL psleep [vmdaemon]
   DL psleep [pagedaemon]
   DL ccb_scan [xpt_thrd]
   DL waiting_ [sctp_iterator]
   DL ctl_work [ctl_thrd]
   DL cooling [acpi_cooling0]
   DL tzpoll [acpi_thermal]
   DL (threaded) [usb]
   ...
   DL - [yarrow]
   DL (threaded) [geom]
   D - [g_down]
   D - [g_up]
   D - [g_event]
   RL (threaded) [intr]
   I [irq15: ata1]
   ...
   Run CPU0 [swi6: Giant taskq]
-- does this one indicate the CPU is actually running
this?
   (after a db cont, wait a while ctrlaltesc db
   ps
it is still the same)
   I [swi4: clock]
   I [swi1: netisr 0]
   I [swi3: vm]
   RL [idle: cpu0]
   SLs wait [init]
   DL audit_wo [audit]
   DLs (threaded) [kernel]
   D - [deadlkres]
   ...
   D sched [swapper]
   
I have no idea if this ps output helps, unless it
indicates
that it is looping on the Giant taskq?
   Might be. You could do 'bt pid' for the process to see
   where
   it
   loops.
   Another good set of hints is at
   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
 
  Kostik, you must be clairvoyant;-)
 
  When I did show alllocks, I found that the syncer process
  held
  - exclusive sleep mutex mount mtx locked @
  kern/vfs_subr.c:4720
  - exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
  The trace for this process goes like:
   spinlock_exit
   mtx_unlock_spin_flags
   kern_yield
   _mnt_vnode_next_active
   vnode_next_active
   vfs_msync()
 
  So, it seems like your r244095 commit might have fixed this?
  (I'm not good at this stuff, but from your description, it
  looks
   like it did the kern_yield() with the mutex held and
   maybe
   got into trouble trying to acquire Giant?)
 
  Anyhow, I'm going to test a kernel with r244095 in it and
  see
  if I can still reproduce the hang.
  (There wasn't much else in the show alllocks, except a
   process that held the exclusive vnode interlock mutex plus
   a ufs vnode lock, but it's just doing a witness_unlock.)
 There must be a thread blocked for the mount interlock for the
 loop
 in the mnt_vnode_next_active to cause livelock.

Yes. I am getting hangs with the -current kernel and they seem
easier for me to reproduce.
   
For the one I just did, the syncer seems to be blocked at
 VI_TRYLOCK() in _mnt_vnode_next_active().
   trylock cannot block.
  
The vnode interlock mutex is eclusively locked by a sh
process (11627). Now, here is where it gets weird...
When I do a db trace 11627 I get the following:
witness_unlock+0x1f3 (subr_witness.c:1563)
mtx_unlock_flags+0x9f (kern_mutex.c:250)
vdropl+0x63 (vfs_subr.c:2405)
vputx+0x130 (vfs_subr.c:2116)
vput+0x10 (vfs_subr.c:2319)
vm_mmap+0x52e 

Re: r244036 kernel hangs under load.

2012-12-11 Thread Konstantin Belousov
On Tue, Dec 11, 2012 at 08:58:47PM -0500, Rick Macklem wrote:
 Ok, I'll test r243598 and then r243599 and r243835, just to
 see if it really is this.
 
 I'll email when I have done this.
If you test only r243598, I am sure that you would experience corruption.
The r243599 should cause the deadlocks.

 
  
Also, do you use the post-r244095 kernel ?
  
   Before and after. The most recent tests were post-r244095.
   (If anything the more recent kernels hang more easily.)
  
  
   
Is your machine SMP ?
  
   Old, slow single core i386.
  
  Try this. Please note that this is mostly a debugging facility.
  
 It seemed to help, but didn't stop the hangs completely.
 r244125 without the patch would hang somewhere in a kernel
 build. r244125 plus this patch ran almost 2 kernel builds
 before it got hung.

Can you try to look into the system state on the hang (on the kernel
with the if (1 || patch applied) ? Using the ddb and recipe from the
web page. Basically watch for a thread looping in the mnt_active
iterator and threads owning vnode interlocks.


pgpyyYZqSAQ4m.pgp
Description: PGP signature


Re: r244036 kernel hangs under load.

2012-12-10 Thread Adrian Chadd
.. what was the previous kernel version?



adrian


On 9 December 2012 22:08, Tim Kientzle kient...@freebsd.org wrote:
 I haven't found any useful clues yet, but thought I'd ask if anyone else
 was seeing hangs in a recent kernel.

 I just upgraded to r244036 using a straight GENERIC i386 kernel.
 (Straight buildworld/buildkernel, no local changes, /etc/src.conf doesn't
 exist, /etc/make.conf just has PERL_VERSION defined.)

 When I try to cross build an ARM world on the resulting system,
 the entire system hangs hard after about 30 minutes:  No network,
 no keyboard response, no nothing.

 Don't know if it's relevant, but the system is using NFS pretty
 heavily (Parallels VM mounting NFS from Mac OS 10.7 host.)

 I'll try to get some more details ...

 Tim

 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: r244036 kernel hangs under load.

2012-12-10 Thread Rick Macklem
Adrian Chadd wrote:
 .. what was the previous kernel version?
 
Hopefully Tim has it narrowed down more, but I don't see
the hangs on a Sept. 7 kernel from head and I do see them
on a Dec. 3 kernel from head. (Don't know the eact rNN.)

It seems to predate my commit (r244008), which was my first
concern.

I use old single core i386 hardware and can fairly reliably
reproduce it by doing a kernel build and a svn checkout
concurrently. No NFS activity. These are running on a local
disk (UFS/FFS). (The kernel I reproduce it on is built via
GENERIC for i386. If you want me to start a binary search
for which rNN, I can do that, but it will take a while.:-)

I can get out into DDB, but I'll admit I don't know enough
about it to know where to look;-)
Here's some lines from db ps, in case they give someone
useful information. (I can leave this box sitting in DB for
the rest of to-day, in case someone can suggest what I should
look for on it.)

Just snippets...
   Ss pause adjkerntz
   DL sdflush  [sofdepflush]
   RL[syncer]
   DL vlruwt   [vnlru]
   DL psleep   [bufdaemon]
   RL  [pagezero]
   DL psleep   [vmdaemon]
   DL psleep   [pagedaemon]
   DL ccb_scan [xpt_thrd]
   DL waiting_ [sctp_iterator]
   DL ctl_work [ctl_thrd]
   DL cooling  [acpi_cooling0]
   DL tzpoll   [acpi_thermal]
   DL (threaded) [usb]
   ...
   DL -[yarrow]
   DL (threaded) [geom]
   D  - [g_down]
   D  - [g_up]
   D  - [g_event]
   RL   (threaded) [intr]
   I[irq15: ata1]
   ...
   Run CPU0[swi6: Giant taskq]
-- does this one indicate the CPU is actually running this?
   (after a db cont, wait a while ctrlaltesc db ps
it is still the same)
   I[swi4: clock]
   I[swi1: netisr 0]
   I[swi3: vm]
   RL   [idle: cpu0]
   SLs wait [init]
   DL  audit_wo [audit]
   DLs (threaded) [kernel]
   D  - [deadlkres]
   ...
   D   sched[swapper]

I have no idea if this ps output helps, unless it indicates
that it is looping on the Giant taskq?

As I said, I can leave it in db for to-day, if anyone wants
me to do anything in the debugger and I can probably reproduce
it, if someone wants stuff tried later.

rick


 
 
 adrian
 
 
 On 9 December 2012 22:08, Tim Kientzle kient...@freebsd.org wrote:
  I haven't found any useful clues yet, but thought I'd ask if anyone
  else
  was seeing hangs in a recent kernel.
 
  I just upgraded to r244036 using a straight GENERIC i386 kernel.
  (Straight buildworld/buildkernel, no local changes, /etc/src.conf
  doesn't
  exist, /etc/make.conf just has PERL_VERSION defined.)
 
  When I try to cross build an ARM world on the resulting system,
  the entire system hangs hard after about 30 minutes: No network,
  no keyboard response, no nothing.
 
  Don't know if it's relevant, but the system is using NFS pretty
  heavily (Parallels VM mounting NFS from Mac OS 10.7 host.)
 
  I'll try to get some more details ...
 
  Tim
 
  ___
  freebsd-current@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-current
  To unsubscribe, send any mail to
  freebsd-current-unsubscr...@freebsd.org
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to
 freebsd-current-unsubscr...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: r244036 kernel hangs under load.

2012-12-10 Thread Konstantin Belousov
On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
 Adrian Chadd wrote:
  .. what was the previous kernel version?
  
 Hopefully Tim has it narrowed down more, but I don't see
 the hangs on a Sept. 7 kernel from head and I do see them
 on a Dec. 3 kernel from head. (Don't know the eact rNN.)
 
 It seems to predate my commit (r244008), which was my first
 concern.
 
 I use old single core i386 hardware and can fairly reliably
 reproduce it by doing a kernel build and a svn checkout
 concurrently. No NFS activity. These are running on a local
 disk (UFS/FFS). (The kernel I reproduce it on is built via
 GENERIC for i386. If you want me to start a binary search
 for which rNN, I can do that, but it will take a while.:-)
 
 I can get out into DDB, but I'll admit I don't know enough
 about it to know where to look;-)
 Here's some lines from db ps, in case they give someone
 useful information. (I can leave this box sitting in DB for
 the rest of to-day, in case someone can suggest what I should
 look for on it.)
 
 Just snippets...
Ss pause adjkerntz
DL sdflush  [sofdepflush]
RL[syncer]
DL vlruwt   [vnlru]
DL psleep   [bufdaemon]
RL  [pagezero]
DL psleep   [vmdaemon]
DL psleep   [pagedaemon]
DL ccb_scan [xpt_thrd]
DL waiting_ [sctp_iterator]
DL ctl_work [ctl_thrd]
DL cooling  [acpi_cooling0]
DL tzpoll   [acpi_thermal]
DL (threaded) [usb]
...
DL -[yarrow]
DL (threaded) [geom]
D  - [g_down]
D  - [g_up]
D  - [g_event]
RL   (threaded) [intr]
I[irq15: ata1]
...
Run CPU0[swi6: Giant taskq]
 -- does this one indicate the CPU is actually running this?
(after a db cont, wait a while ctrlaltesc db ps
 it is still the same)
I[swi4: clock]
I[swi1: netisr 0]
I[swi3: vm]
RL   [idle: cpu0]
SLs wait [init]
DL  audit_wo [audit]
DLs (threaded) [kernel]
D  - [deadlkres]
...
D   sched[swapper]
 
 I have no idea if this ps output helps, unless it indicates
 that it is looping on the Giant taskq?
Might be. You could do 'bt pid' for the process to see where it loops.
Another good set of hints is at
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

 
 As I said, I can leave it in db for to-day, if anyone wants
 me to do anything in the debugger and I can probably reproduce
 it, if someone wants stuff tried later.
 
 rick
 
 
  
  
  adrian
  
  
  On 9 December 2012 22:08, Tim Kientzle kient...@freebsd.org wrote:
   I haven't found any useful clues yet, but thought I'd ask if anyone
   else
   was seeing hangs in a recent kernel.
  
   I just upgraded to r244036 using a straight GENERIC i386 kernel.
   (Straight buildworld/buildkernel, no local changes, /etc/src.conf
   doesn't
   exist, /etc/make.conf just has PERL_VERSION defined.)
  
   When I try to cross build an ARM world on the resulting system,
   the entire system hangs hard after about 30 minutes: No network,
   no keyboard response, no nothing.
  
   Don't know if it's relevant, but the system is using NFS pretty
   heavily (Parallels VM mounting NFS from Mac OS 10.7 host.)
  
   I'll try to get some more details ...
  
   Tim
  
   ___
   freebsd-current@freebsd.org mailing list
   http://lists.freebsd.org/mailman/listinfo/freebsd-current
   To unsubscribe, send any mail to
   freebsd-current-unsubscr...@freebsd.org
  ___
  freebsd-current@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-current
  To unsubscribe, send any mail to
  freebsd-current-unsubscr...@freebsd.org
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


pgp8rafpIDxGe.pgp
Description: PGP signature


Re: r244036 kernel hangs under load.

2012-12-10 Thread Rick Macklem
Konstantin Belousov wrote:
 On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
  Adrian Chadd wrote:
   .. what was the previous kernel version?
  
  Hopefully Tim has it narrowed down more, but I don't see
  the hangs on a Sept. 7 kernel from head and I do see them
  on a Dec. 3 kernel from head. (Don't know the eact rNN.)
 
  It seems to predate my commit (r244008), which was my first
  concern.
 
  I use old single core i386 hardware and can fairly reliably
  reproduce it by doing a kernel build and a svn checkout
  concurrently. No NFS activity. These are running on a local
  disk (UFS/FFS). (The kernel I reproduce it on is built via
  GENERIC for i386. If you want me to start a binary search
  for which rNN, I can do that, but it will take a while.:-)
 
  I can get out into DDB, but I'll admit I don't know enough
  about it to know where to look;-)
  Here's some lines from db ps, in case they give someone
  useful information. (I can leave this box sitting in DB for
  the rest of to-day, in case someone can suggest what I should
  look for on it.)
 
  Just snippets...
 Ss pause adjkerntz
 DL sdflush [sofdepflush]
 RL [syncer]
 DL vlruwt [vnlru]
 DL psleep [bufdaemon]
 RL [pagezero]
 DL psleep [vmdaemon]
 DL psleep [pagedaemon]
 DL ccb_scan [xpt_thrd]
 DL waiting_ [sctp_iterator]
 DL ctl_work [ctl_thrd]
 DL cooling [acpi_cooling0]
 DL tzpoll [acpi_thermal]
 DL (threaded) [usb]
 ...
 DL - [yarrow]
 DL (threaded) [geom]
 D - [g_down]
 D - [g_up]
 D - [g_event]
 RL (threaded) [intr]
 I [irq15: ata1]
 ...
 Run CPU0 [swi6: Giant taskq]
  -- does this one indicate the CPU is actually running this?
 (after a db cont, wait a while ctrlaltesc db ps
  it is still the same)
 I [swi4: clock]
 I [swi1: netisr 0]
 I [swi3: vm]
 RL [idle: cpu0]
 SLs wait [init]
 DL audit_wo [audit]
 DLs (threaded) [kernel]
 D - [deadlkres]
 ...
 D sched [swapper]
 
  I have no idea if this ps output helps, unless it indicates
  that it is looping on the Giant taskq?
 Might be. You could do 'bt pid' for the process to see where it
 loops.
 Another good set of hints is at
 http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

Kostik, you must be clairvoyant;-)

When I did show alllocks, I found that the syncer process held
- exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720
- exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
The trace for this process goes like:
 spinlock_exit
 mtx_unlock_spin_flags
 kern_yield
 _mnt_vnode_next_active
 vnode_next_active
 vfs_msync()

So, it seems like your r244095 commit might have fixed this?
(I'm not good at this stuff, but from your description, it looks
 like it did the kern_yield() with the mutex held and maybe
 got into trouble trying to acquire Giant?)

Anyhow, I'm going to test a kernel with r244095 in it and see
if I can still reproduce the hang.
(There wasn't much else in the show alllocks, except a
 process that held the exclusive vnode interlock mutex plus
 a ufs vnode lock, but it's just doing a witness_unlock.)

I'll email if/when I know more, rick
ps: Fingers/toes crossed that you've already fixed it.

 
 
  As I said, I can leave it in db for to-day, if anyone wants
  me to do anything in the debugger and I can probably reproduce
  it, if someone wants stuff tried later.
 
  rick
 
 
  
  
   adrian
  
  
   On 9 December 2012 22:08, Tim Kientzle kient...@freebsd.org
   wrote:
I haven't found any useful clues yet, but thought I'd ask if
anyone
else
was seeing hangs in a recent kernel.
   
I just upgraded to r244036 using a straight GENERIC i386 kernel.
(Straight buildworld/buildkernel, no local changes,
/etc/src.conf
doesn't
exist, /etc/make.conf just has PERL_VERSION defined.)
   
When I try to cross build an ARM world on the resulting system,
the entire system hangs hard after about 30 minutes: No network,
no keyboard response, no nothing.
   
Don't know if it's relevant, but the system is using NFS pretty
heavily (Parallels VM mounting NFS from Mac OS 10.7 host.)
   
I'll try to get some more details ...
   
Tim
   
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to
freebsd-current-unsubscr...@freebsd.org
   ___
   freebsd-current@freebsd.org mailing list
   http://lists.freebsd.org/mailman/listinfo/freebsd-current
   To unsubscribe, send any mail to
   freebsd-current-unsubscr...@freebsd.org
  ___
  freebsd-current@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-current
  To unsubscribe, send any mail to
  

Re: r244036 kernel hangs under load.

2012-12-10 Thread Konstantin Belousov
On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
 Konstantin Belousov wrote:
  On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
   Adrian Chadd wrote:
.. what was the previous kernel version?
   
   Hopefully Tim has it narrowed down more, but I don't see
   the hangs on a Sept. 7 kernel from head and I do see them
   on a Dec. 3 kernel from head. (Don't know the eact rNN.)
  
   It seems to predate my commit (r244008), which was my first
   concern.
  
   I use old single core i386 hardware and can fairly reliably
   reproduce it by doing a kernel build and a svn checkout
   concurrently. No NFS activity. These are running on a local
   disk (UFS/FFS). (The kernel I reproduce it on is built via
   GENERIC for i386. If you want me to start a binary search
   for which rNN, I can do that, but it will take a while.:-)
  
   I can get out into DDB, but I'll admit I don't know enough
   about it to know where to look;-)
   Here's some lines from db ps, in case they give someone
   useful information. (I can leave this box sitting in DB for
   the rest of to-day, in case someone can suggest what I should
   look for on it.)
  
   Just snippets...
  Ss pause adjkerntz
  DL sdflush [sofdepflush]
  RL [syncer]
  DL vlruwt [vnlru]
  DL psleep [bufdaemon]
  RL [pagezero]
  DL psleep [vmdaemon]
  DL psleep [pagedaemon]
  DL ccb_scan [xpt_thrd]
  DL waiting_ [sctp_iterator]
  DL ctl_work [ctl_thrd]
  DL cooling [acpi_cooling0]
  DL tzpoll [acpi_thermal]
  DL (threaded) [usb]
  ...
  DL - [yarrow]
  DL (threaded) [geom]
  D - [g_down]
  D - [g_up]
  D - [g_event]
  RL (threaded) [intr]
  I [irq15: ata1]
  ...
  Run CPU0 [swi6: Giant taskq]
   -- does this one indicate the CPU is actually running this?
  (after a db cont, wait a while ctrlaltesc db ps
   it is still the same)
  I [swi4: clock]
  I [swi1: netisr 0]
  I [swi3: vm]
  RL [idle: cpu0]
  SLs wait [init]
  DL audit_wo [audit]
  DLs (threaded) [kernel]
  D - [deadlkres]
  ...
  D sched [swapper]
  
   I have no idea if this ps output helps, unless it indicates
   that it is looping on the Giant taskq?
  Might be. You could do 'bt pid' for the process to see where it
  loops.
  Another good set of hints is at
  http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
 
 Kostik, you must be clairvoyant;-)
 
 When I did show alllocks, I found that the syncer process held
 - exclusive sleep mutex mount mtx locked @ kern/vfs_subr.c:4720
 - exclusive lockmgr syncer locked @ kern/vfs_subr.c:1780
 The trace for this process goes like:
  spinlock_exit
  mtx_unlock_spin_flags
  kern_yield
  _mnt_vnode_next_active
  vnode_next_active
  vfs_msync()
 
 So, it seems like your r244095 commit might have fixed this?
 (I'm not good at this stuff, but from your description, it looks
  like it did the kern_yield() with the mutex held and maybe
  got into trouble trying to acquire Giant?)
 
 Anyhow, I'm going to test a kernel with r244095 in it and see
 if I can still reproduce the hang.
 (There wasn't much else in the show alllocks, except a
  process that held the exclusive vnode interlock mutex plus
  a ufs vnode lock, but it's just doing a witness_unlock.)
There must be a thread blocked for the mount interlock for the loop
in the mnt_vnode_next_active to cause livelock.

 
 I'll email if/when I know more, rick
 ps: Fingers/toes crossed that you've already fixed it.
 
  
  
   As I said, I can leave it in db for to-day, if anyone wants
   me to do anything in the debugger and I can probably reproduce
   it, if someone wants stuff tried later.
  
   rick
  
  
   
   
adrian
   
   
On 9 December 2012 22:08, Tim Kientzle kient...@freebsd.org
wrote:
 I haven't found any useful clues yet, but thought I'd ask if
 anyone
 else
 was seeing hangs in a recent kernel.

 I just upgraded to r244036 using a straight GENERIC i386 kernel.
 (Straight buildworld/buildkernel, no local changes,
 /etc/src.conf
 doesn't
 exist, /etc/make.conf just has PERL_VERSION defined.)

 When I try to cross build an ARM world on the resulting system,
 the entire system hangs hard after about 30 minutes: No network,
 no keyboard response, no nothing.

 Don't know if it's relevant, but the system is using NFS pretty
 heavily (Parallels VM mounting NFS from Mac OS 10.7 host.)

 I'll try to get some more details ...

 Tim

 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to
 freebsd-current-unsubscr...@freebsd.org
___
freebsd-current@freebsd.org mailing list

Re: r244036 kernel hangs under load.

2012-12-10 Thread Tim Kientzle

On Dec 10, 2012, at 10:38 AM, Rick Macklem wrote:

 Adrian Chadd wrote:
 .. what was the previous kernel version?
 
 Hopefully Tim has it narrowed down more, but I don't see
 the hangs on a Sept. 7 kernel from head and I do see them
 on a Dec. 3 kernel from head. (Don't know the eact rNN.)

I haven't had a chance to narrow it down any yet, but I think
my previous kernel was from around Aug 25.

Tim

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


r244036 kernel hangs under load.

2012-12-09 Thread Tim Kientzle
I haven't found any useful clues yet, but thought I'd ask if anyone else
was seeing hangs in a recent kernel.

I just upgraded to r244036 using a straight GENERIC i386 kernel.
(Straight buildworld/buildkernel, no local changes, /etc/src.conf doesn't
exist, /etc/make.conf just has PERL_VERSION defined.)

When I try to cross build an ARM world on the resulting system,
the entire system hangs hard after about 30 minutes:  No network,
no keyboard response, no nothing.

Don't know if it's relevant, but the system is using NFS pretty
heavily (Parallels VM mounting NFS from Mac OS 10.7 host.)

I'll try to get some more details ...

Tim

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org