Re: [vfs_bio] Re: Fatal trap 12: page fault while in kernel mode (with potential cause)

2007-06-24 Thread Adam McDougall
On Sun, Jun 24, 2007 at 12:30:20AM -0400, Adam McDougall wrote:

  On Mon, Apr 23, 2007 at 11:55:52AM -0400, Kris Kennaway wrote:
  
On Mon, Apr 23, 2007 at 05:35:47PM +0200, Kai wrote:
 On Thu, Apr 19, 2007 at 02:33:29PM +0200, Kai wrote:
  On Wed, Apr 11, 2007 at 12:53:32PM +0200, Kai wrote:
   
   Hello all,
   
   We're running into regular panics on our webserver after upgrading
   from 4.x to 6.2-stable:
  
 
 Hi all,
 
 To continue this story, a colleague wrote a small program in C that 
launches
 40 threads to randomly append and write to 10 files on an NFS mounted
 filesystem. 
 
 If I keep removing the files on one of the other machines in a while loop,
 the first system panics:
 
 Fatal trap 12: page fault while in kernel mode
 cpuid = 1; apic id = 01
 fault virtual address   = 0x34
 fault code  = supervisor read, page not present
 instruction pointer = 0x20:0xc06bdefa
 stack pointer   = 0x28:0xeb9f69b8
 frame pointer   = 0x28:0xeb9f69c4
 code segment= base 0x0, limit 0xf, type 0x1b
 = DPL 0, pres 1, def32 1, gran 1
 processor eflags= interrupt enabled, resume, IOPL = 0
 current process = 73626 (nfscrash)
 trap number = 12
 panic: page fault
 cpuid = 1
 Uptime: 3h2m14s
 
 Sounds like a nice denial of service problem. I can hand the program to
 developers on request.

Please send it to me.  Panics are always much easier to get fixed if
they come with a test case that developer can use to reproduce it.

Kris
  
  I have been working on this problem all weekend and I have a strong hunch at 
this point 
  that it is a result of 1.424 of sys/kern/vfs_bio.c which was between FreeBSD 
5.1 and 
  5.2.  This hunch is currently being verified by a system that was cvsupped to 
code 
  just before 1.424, and it has been running about 7 times longer than the 
usual time 
  required to crash.  I am currently attempting to craft a patch for 6.2 that 
essentially 
  backs out the change to see if that works, but if this information can help 
send a 
  FreeBSD developer down the right trail to a proper fix, great.  I will follow 
up with 
  more detailed findings and results tonight or soon.
  
  links:
  
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.423;r2=1.424
  related to 1.424:
  
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.420r2=1.421
  
  Commit emails:
  http://docs.freebsd.org/cgi/mid.cgi?200311150845.hAF8jawU027349
  http://docs.freebsd.org/cgi/mid.cgi?20030445.hAB4jbYw093253
  ___

If I turn on invariants, I get the following panic instead, much quicker, and 
happens with at least as far back as 5.0-RELEASE:

panic: bundirty: buffer 0x8e2e95f8 still on queue 1
cpuid = 1
Uptime: 35s
Dumping 511 MB (2 chunks)
  chunk 0: 1MB (153 pages) ... ok
  chunk 1: 511MB (130816 pages) 496 480 464 448 432 416 400 384 368 352 336 320 
304 288 272 256 240 224 208 192 176 
160 144 128 112 96 80 64 48 32 16

#0  doadump () at pcpu.h:172
172 pcpu.h: No such file or directory.
in pcpu.h
(kgdb) bt
#0  doadump () at pcpu.h:172
#1  0x8028d699 in boot (howto=260) at 
/usr/src/sys/kern/kern_shutdown.c:409
#2  0x8028d12b in panic (fmt=0x80443458 bundirty: buffer %p 
still on queue %d)
at /usr/src/sys/kern/kern_shutdown.c:565
#3  0x802e1e78 in bundirty (bp=0x8e2e95f8) at 
/usr/src/sys/kern/vfs_bio.c:1055
#4  0x802e3eb1 in brelse (bp=0x8e2e95f8) at 
/usr/src/sys/kern/vfs_bio.c:1370
#5  0x803550e8 in nfs_writebp (bp=0x8e2e95f8, force=0, td=0x0) 
at 
/usr/src/sys/nfsclient/nfs_vnops.c:3005
#6  0x802e5197 in getblk (vp=0xff000c23e5d0, blkno=0, size=14400, 
slpflag=256, slptimeo=0, flags=0)
at buf.h:412
#7  0x80344f13 in nfs_getcacheblk (vp=0xff000c23e5d0, bn=0, 
size=14400, td=0xff0015b274c0)
at /usr/src/sys/nfsclient/nfs_bio.c:1252
#8  0x8034616c in nfs_write (ap=0x0) at 
/usr/src/sys/nfsclient/nfs_bio.c:1068
#9  0x80405ee4 in VOP_WRITE_APV (vop=0x805a0260, 
a=0x976bfa10) at vnode_if.c:698
#10 0x80303d2c in vn_write (fp=0xff000f524000, 
uio=0x976bfb50, active_cred=0x0, flags=0, 
td=0xff0015b274c0) at vnode_if.h:372
#11 0x802ba2e5 in dofilewrite (td=0xff0015b274c0, fd=3, 
fp=0xff000f524000, auio=0x976bfb50, 
offset=0, flags=0) at file.h:253
#12 0x802ba5e1 in kern_writev (td=0xff0015b274c0, fd=3, 
auio=0x976bfb50)
at /usr/src/sys/kern/sys_generic.c:402
#13 0x802ba6da in write (td=0x0, uap=0x0) at 
/usr/src/sys/kern/sys_generic.c:326
#14 0x803c6db2 in syscall (frame=
  {tf_rdi = 3, tf_rsi = 140737488344336, 

[vfs_bio] Re: Fatal trap 12: page fault while in kernel mode (with potential cause, fix?)

2007-06-23 Thread Adam McDougall
On Mon, Apr 23, 2007 at 11:55:52AM -0400, Kris Kennaway wrote:

  On Mon, Apr 23, 2007 at 05:35:47PM +0200, Kai wrote:
   On Thu, Apr 19, 2007 at 02:33:29PM +0200, Kai wrote:
On Wed, Apr 11, 2007 at 12:53:32PM +0200, Kai wrote:
 
 Hello all,
 
 We're running into regular panics on our webserver after upgrading
 from 4.x to 6.2-stable:

   
   Hi all,
   
   To continue this story, a colleague wrote a small program in C that launches
   40 threads to randomly append and write to 10 files on an NFS mounted
   filesystem. 
   
   If I keep removing the files on one of the other machines in a while loop,
   the first system panics:
   
   Fatal trap 12: page fault while in kernel mode
   cpuid = 1; apic id = 01
   fault virtual address   = 0x34
   fault code  = supervisor read, page not present
   instruction pointer = 0x20:0xc06bdefa
   stack pointer   = 0x28:0xeb9f69b8
   frame pointer   = 0x28:0xeb9f69c4
   code segment= base 0x0, limit 0xf, type 0x1b
   = DPL 0, pres 1, def32 1, gran 1
   processor eflags= interrupt enabled, resume, IOPL = 0
   current process = 73626 (nfscrash)
   trap number = 12
   panic: page fault
   cpuid = 1
   Uptime: 3h2m14s
   
   Sounds like a nice denial of service problem. I can hand the program to
   developers on request.
  
  Please send it to me.  Panics are always much easier to get fixed if
  they come with a test case that developer can use to reproduce it.
  
  Kris

I have been working on this problem all weekend and I have a strong hunch at 
this point 
that it is a result of 1.424 of sys/kern/vfs_bio.c which was between FreeBSD 
5.1 and 
5.2.  This hunch is currently being verified by a system that was cvsupped to 
code 
just before 1.424, and it has been running about 7 times longer than the usual 
time 
required to crash.  I am currently attempting to craft a patch for 6.2 that 
essentially 
backs out the change to see if that works, but if this information can help 
send a 
FreeBSD developer down the right trail to a proper fix, great.  I will follow 
up with 
more detailed findings and results tonight or soon.

links:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.423;r2=1.424
related to 1.424:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.420r2=1.421

Commit emails:
http://docs.freebsd.org/cgi/mid.cgi?200311150845.hAF8jawU027349
http://docs.freebsd.org/cgi/mid.cgi?20030445.hAB4jbYw093253
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]