Re: page fault in _mtx_lock_flags

2002-04-29 Thread John Baldwin


On 29-Apr-2002 Robert Watson wrote:
 
 If I apply the attached diff to the kern_malloc.c, backing out a portion
 of kern_malloc.c:1.99, the rate of panics plummets.  Previously, I could
 have a box panic within five minutes of getting the crash boxes spinning. 
 Now I've been going for about 40 minutes without any perceived failures
 (i.e., no panics).  I have no idea why this fixes the problem, but David
 Wolfskill pointed me at that particular revision as being a source of
 related problems for him.  I'm going to leave the boxes running overnight
 and see what I bump into.  It would be nice to know if this is masking the
 problem, or fixing the problem, and if so, why. 

You have memory corruption it looks like.  I think the patch adds new buckets
of larger sizes.  Perhaps the problem is a bug in uma where someone allocates
something bigger than the largest bucket, and the chunk they get back is only
the size of an item in the largest bucket, thus when the code writes to the
end of the structure it is overwriting other memory.

-- 

John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/
Power Users Use the Power to Serve!  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: page fault in _mtx_lock_flags

2002-04-29 Thread Robert Watson


On Mon, 29 Apr 2002, John Baldwin wrote:

 On 29-Apr-2002 Robert Watson wrote:
  
  If I apply the attached diff to the kern_malloc.c, backing out a portion
  of kern_malloc.c:1.99, the rate of panics plummets.  Previously, I could
  have a box panic within five minutes of getting the crash boxes spinning. 
  Now I've been going for about 40 minutes without any perceived failures
  (i.e., no panics).  I have no idea why this fixes the problem, but David
  Wolfskill pointed me at that particular revision as being a source of
  related problems for him.  I'm going to leave the boxes running overnight
  and see what I bump into.  It would be nice to know if this is masking the
  problem, or fixing the problem, and if so, why. 
 
 You have memory corruption it looks like.  I think the patch adds new
 buckets of larger sizes.  Perhaps the problem is a bug in uma where
 someone allocates something bigger than the largest bucket, and the
 chunk they get back is only the size of an item in the largest bucket,
 thus when the code writes to the end of the structure it is overwriting
 other memory. 

That was what I was theorizing when I made the change, but I haven't
really had much time lately to read the UMA code, so it's greek to me. :-)

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: page fault in _mtx_lock_flags

2002-04-29 Thread Dag-Erling Smorgrav

John Baldwin [EMAIL PROTECTED] writes:
 On 28-Apr-2002 Robert Watson wrote:
  db trace
  _mtx_lock_flags(79747473,0,c03cb862,e3) at _mtx_lock_flags+0x42
 
 Same here.  See the first arg which is supposed to be a mutex pointer.
 
 ytts

stty, actually, since the i386 is little-endian.

DES
-- 
Dag-Erling Smorgrav - [EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: page fault in _mtx_lock_flags

2002-04-28 Thread Robert Watson

I also get an almost identical fault on crash1 involving mdconfig as
opposed to sh:

ray irq 10
NFS ROOT: 192.168.50.1:/cboss/devel/nfsroot/crash1.cboss.tislabs.com
8.50.10 BroadcasP-Address 192.16
t 192.168.50.255

Fatal trap 12: page fault while in kernel mode
cpuid = 1; lapic.id = 0100
fault virtual address   = 0x6b73697c
fault code  = supervisor write, page not present
instruction pointer = 0x8:0xc02449b6
stack pointer   = 0x10:0xc93d8a14
frame pointer   = 0x10:0xc93d8a20
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 44 (mdconfig)
kernel: type 12 trap, code=0
Stopped at  _mtx_lock_flags+0x42:   lock cmpxchgl   %ecx,0x18(%ebx)
db trace
_mtx_lock_flags(6b736964,0,c03cb862,e3) at _mtx_lock_flags+0x42
lockmgr(c93a8228,101,0,c8f27100) at lockmgr+0x42
vfs_busy(c93a8200,0,0,c8f27100) at vfs_busy+0x58
lookup(c93d8c28,0,c93b9c34,c93d8d20,c8f27100) at lookup+0x3a2
namei(c93d8c28,0,c93b9c34,c93d8d20,0) at namei+0x1c8
vn_open_cred(c93d8c28,c93d8bf4,0,c3f80c80,c93d8ce8) at vn_open_cred+0x23b
vn_open(c93d8c28,c93d8bf4,0,c8f271dc,c8f27000) at vn_open+0x18
open(c8f27100,c93d8d20,0,0,0) at open+0x158
syscall(2f,2f,2f,0,0) at syscall+0x223
syscall_with_err_pushed() at syscall_with_err_pushed+0x1b
--- syscall (5, FreeBSD ELF, open), eip = 0x804950b, esp = 0xbfbffd14, ebp
= 0xbfbffd50 ---
db Context switches not allowed in the debugger.
db 

Still not clear what the origin of this is -- possibly memory corruption
of the mutex..?


Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

On Sun, 28 Apr 2002, Robert Watson wrote:

 
 As usual, GENERIC -CURRENT head from last night, from the main tree. 
 Dual-proc SMP box netbooted using PXE.  System usually boots, does a
 buildkernel -j 8 over NFS, then reboots and repeats.  This time it didn't. 
 
 I actually have two boxes doing this, which does seem to double the rate
 of panics I get.
 
 APIC_IO: Testing 8254 interrupt delivery
 APIC_IO: Broken MP table detected: 8254 is not connected to IOAPIC #0 intpin 2
 APIC_IO: routing 8254 via 8259 and IOAPIC #0 intpin 0
 ad0: 19458MB ST320420A [39535/16/63] at ata0-master UDMA33
 acd0: CDROM MATSHITA CR-176 at ata1-master PIO4
 doSuMnPt:i nAgP  rCoPoUt  #f1r oLma unnfcsh:etsray irq 10
 NFS ROOT: 192.168.50.1:/cboss/devel/nfsroot/crash1.cboss.tislabs.com
 
 
 Fatal trap 12: page fault while in kernel mode
 cpuid = 0; lapic.id = 
 fault virtual address   = 0x7974748b
 fault code  = supervisor write, page not present
 instruction pointer = 0x8:0xc02449b6
 stack pointer   = 0x10:0xc93dea14
 frame pointer   = 0x10:0xc93dea20
 code segment= base 0x0, limit 0xf, type 0x1b
 = DPL 0, pres 1, def32 1, gran 1
 processor eflags= interrupt enabled, resume, IOPL = 0
 current process = 41 (sh)
 kernel: type 12 trap, code=0
 Stopped at  _mtx_lock_flags+0x42:   lock cmpxchgl   %ecx,0x18(%ebx)
 db trace
 _mtx_lock_flags(79747473,0,c03cb862,e3) at _mtx_lock_flags+0x42
 lockmgr(c93a8228,101,0,c8f27100) at lockmgr+0x42
 vfs_busy(c93a8200,0,0,c8f27100) at vfs_busy+0x58
 lookup(c93dec28,1a4,c8f03034,c93ded20,c8f27100) at lookup+0x3a2
 namei(c93dec28,1a4,c8f03034,c93ded20,0) at namei+0x1c8
 vn_open_cred(c93dec28,c93debf4,1a4,c3f80c80,c93dece8) at vn_open_cred+0x67
 vn_open(c93dec28,c93debf4,1a4,c8f271dc,c8f27000) at vn_open+0x18
 open(c8f27100,c93ded20,8125005,0,0) at open+0x158
 syscall(2f,2f,2f,0,0) at syscall+0x223
 syscall_with_err_pushed() at syscall_with_err_pushed+0x1b
 --- syscall (5, FreeBSD ELF, open), eip = 0x808969b, esp = 0xbfbff8f0, ebp
 = 0xbfbff91c ---
 db 
 
 (kgdb) l *_mtx_lock_flags+0x42
 0xc02449b6 is in _mtx_lock_flags (machine/atomic.h:139).
 134 static __inline int
 135 atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src)
 136 {
 137 int res = exp;
 138
 139 __asm __volatile (
 140 __XSTRING(MPLOCKED)  
 141cmpxchgl %1,%2 ;
 142setz%%al ;  
 143movzbl  %%al,%0 ;   
 (gdb) l *lockmgr+0x42
 0xc0242376 is in lockmgr (../../../kern/kern_lock.c:228).
 223 pid = LK_KERNPROC;
 224 else
 225 pid = td-td_proc-p_pid;
 226
 227 mtx_lock(lkp-lk_interlock);
 228 if (flags  LK_INTERLOCK) {
 229 mtx_assert(interlkp, MA_OWNED | MA_NOTRECURSED);
 230 mtx_unlock(interlkp);
 231 }
 232
 
 Attempts to get into serial gdb failed:
 
 Fatal trap 12: page fault while in kernel mode
 cpuid = 1; lapic.id = 0100
 fault virtual address   = 0x6aa
 fault code  = supervisor read, page not present
 

Re: page fault in _mtx_lock_flags

2002-04-28 Thread Robert Watson


If I apply the attached diff to the kern_malloc.c, backing out a portion
of kern_malloc.c:1.99, the rate of panics plummets.  Previously, I could
have a box panic within five minutes of getting the crash boxes spinning. 
Now I've been going for about 40 minutes without any perceived failures
(i.e., no panics).  I have no idea why this fixes the problem, but David
Wolfskill pointed me at that particular revision as being a source of
related problems for him.  I'm going to leave the boxes running overnight
and see what I bump into.  It would be nice to know if this is masking the
problem, or fixing the problem, and if so, why. 

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

On Sun, 28 Apr 2002, Robert Watson wrote:

 I also get an almost identical fault on crash1 involving mdconfig as
 opposed to sh:
 
 ray irq 10
 NFS ROOT: 192.168.50.1:/cboss/devel/nfsroot/crash1.cboss.tislabs.com
 8.50.10 BroadcasP-Address 192.16
 t 192.168.50.255
 
 Fatal trap 12: page fault while in kernel mode
 cpuid = 1; lapic.id = 0100
 fault virtual address   = 0x6b73697c
 fault code  = supervisor write, page not present
 instruction pointer = 0x8:0xc02449b6
 stack pointer   = 0x10:0xc93d8a14
 frame pointer   = 0x10:0xc93d8a20
 code segment= base 0x0, limit 0xf, type 0x1b
 = DPL 0, pres 1, def32 1, gran 1
 processor eflags= interrupt enabled, resume, IOPL = 0
 current process = 44 (mdconfig)
 kernel: type 12 trap, code=0
 Stopped at  _mtx_lock_flags+0x42:   lock cmpxchgl   %ecx,0x18(%ebx)
 db trace
 _mtx_lock_flags(6b736964,0,c03cb862,e3) at _mtx_lock_flags+0x42
 lockmgr(c93a8228,101,0,c8f27100) at lockmgr+0x42
 vfs_busy(c93a8200,0,0,c8f27100) at vfs_busy+0x58
 lookup(c93d8c28,0,c93b9c34,c93d8d20,c8f27100) at lookup+0x3a2
 namei(c93d8c28,0,c93b9c34,c93d8d20,0) at namei+0x1c8
 vn_open_cred(c93d8c28,c93d8bf4,0,c3f80c80,c93d8ce8) at vn_open_cred+0x23b
 vn_open(c93d8c28,c93d8bf4,0,c8f271dc,c8f27000) at vn_open+0x18
 open(c8f27100,c93d8d20,0,0,0) at open+0x158
 syscall(2f,2f,2f,0,0) at syscall+0x223
 syscall_with_err_pushed() at syscall_with_err_pushed+0x1b
 --- syscall (5, FreeBSD ELF, open), eip = 0x804950b, esp = 0xbfbffd14, ebp
 = 0xbfbffd50 ---
 db Context switches not allowed in the debugger.
 db 
 
 Still not clear what the origin of this is -- possibly memory corruption
 of the mutex..?
 
 
 Robert N M Watson FreeBSD Core Team, TrustedBSD Project
 [EMAIL PROTECTED]  NAI Labs, Safeport Network Services
 
 On Sun, 28 Apr 2002, Robert Watson wrote:
 
  
  As usual, GENERIC -CURRENT head from last night, from the main tree. 
  Dual-proc SMP box netbooted using PXE.  System usually boots, does a
  buildkernel -j 8 over NFS, then reboots and repeats.  This time it didn't. 
  
  I actually have two boxes doing this, which does seem to double the rate
  of panics I get.
  
  APIC_IO: Testing 8254 interrupt delivery
  APIC_IO: Broken MP table detected: 8254 is not connected to IOAPIC #0 intpin 2
  APIC_IO: routing 8254 via 8259 and IOAPIC #0 intpin 0
  ad0: 19458MB ST320420A [39535/16/63] at ata0-master UDMA33
  acd0: CDROM MATSHITA CR-176 at ata1-master PIO4
  doSuMnPt:i nAgP  rCoPoUt  #f1r oLma unnfcsh:etsray irq 10
  NFS ROOT: 192.168.50.1:/cboss/devel/nfsroot/crash1.cboss.tislabs.com
  
  
  Fatal trap 12: page fault while in kernel mode
  cpuid = 0; lapic.id = 
  fault virtual address   = 0x7974748b
  fault code  = supervisor write, page not present
  instruction pointer = 0x8:0xc02449b6
  stack pointer   = 0x10:0xc93dea14
  frame pointer   = 0x10:0xc93dea20
  code segment= base 0x0, limit 0xf, type 0x1b
  = DPL 0, pres 1, def32 1, gran 1
  processor eflags= interrupt enabled, resume, IOPL = 0
  current process = 41 (sh)
  kernel: type 12 trap, code=0
  Stopped at  _mtx_lock_flags+0x42:   lock cmpxchgl   %ecx,0x18(%ebx)
  db trace
  _mtx_lock_flags(79747473,0,c03cb862,e3) at _mtx_lock_flags+0x42
  lockmgr(c93a8228,101,0,c8f27100) at lockmgr+0x42
  vfs_busy(c93a8200,0,0,c8f27100) at vfs_busy+0x58
  lookup(c93dec28,1a4,c8f03034,c93ded20,c8f27100) at lookup+0x3a2
  namei(c93dec28,1a4,c8f03034,c93ded20,0) at namei+0x1c8
  vn_open_cred(c93dec28,c93debf4,1a4,c3f80c80,c93dece8) at vn_open_cred+0x67
  vn_open(c93dec28,c93debf4,1a4,c8f271dc,c8f27000) at vn_open+0x18
  open(c8f27100,c93ded20,8125005,0,0) at open+0x158
  syscall(2f,2f,2f,0,0) at syscall+0x223
  syscall_with_err_pushed() at syscall_with_err_pushed+0x1b
  --- syscall (5, FreeBSD ELF, open), eip = 0x808969b, esp = 0xbfbff8f0, ebp
  = 0xbfbff91c ---
  db 
  
  (kgdb) l *_mtx_lock_flags+0x42
  0xc02449b6 is in _mtx_lock_flags (machine/atomic.h:139).
  134 static __inline int
  135 atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src)
  136 {
  137 int