Re: 5.1, Data Corruption, Intel, Oh my! [patch] - Fatal trap 12

2003-08-14 Thread Terry Lambert
Bosko Milekic wrote:
  db trace
  _mtx_lock_flags(0,0,c07aa287,11e,c0c21aaa) at _mtx_lock_flags+0x43
  vm_fault(c102f000,c000,2,0,c08205c0) at vm_fault+0x2b4
  trap_pfault(c0c21b9e,0,c4d8,10,c4d8) at trap_pfault+0x152
  trap(6c200018,10,1bc40060,1c,0) at trap+0x30d
  calltrap() at calltrap+0x5
  --- trap 0xc, eip = 0x5949, esp = 0xc0c21bde, dbp = 0xc0c21be4 ---
  (null)(1bf80058,0,530e0102,80202,505a61) at 0x5949
  db

FWIW: This is a NULL function pointer that's trying to call a
function that hasn't been initialized, or has been explicitly
NULL'ed out.  Decoding the pointer values to find out what the
object are would probably go a long way toward knowing what's
going on.  Last time I saw one of these, it was the NFS lease
function.  He might also want to look for any function pointer
that takes 5 arguments; Linux threads is a likely suspect, in
that the thread mailboxes are at a fixed location, so he should
make sure to recompile any kernel modules when he compiles his
new kernel.

BTW: Good work on the patch, both you and Peter!

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 5.1, Data Corruption, Intel, Oh my! [patch] - Fatal trap 12

2003-08-14 Thread Terry Lambert
Peter Edwards wrote:
  ... He might also want to look for any function pointer
  that takes 5 arguments;
 
 Nice tactic, but misleading in this case, methinks.
 
 I assume your basing this on the 5 arguments shown in the backtrace.
 The 5 arguments passed to the function at 0x5949 is probably just
 defaulted; I doubt it has any significance.
 
 Long version:
 
 ddb tries to work out the number of arguments passed to a function at a
 particular stack frame first based on symbolic information for the
 function itself (obviously not an option here), then based on the
 instruction at the return address in that frame. This works at best
 sporadically in the face of -O compiled C code. The fact that there's no
 function under the (null) would strongly suggest that ddb got confused
 with the frame pointer here and didn't get any useful information with
 which to work out the argument count.

I don't know how accurate this assumption is.  I don't thing
DDB is confused, because the NULL is consistent with the reported
fault address.  Even if we assume that it's confused, the PC is
enough information to locate the function pointer dereference that
is occurring.  I also have to assume that the function pointer is
in scope, since it's able to call through it to fault the kernel.


 In the face of failure, ddb just wildly prints out the 5 words under the
 stack pointer.

I did suggest that the correct thing to do would be to decode
what those words were pointing at, and thereby what types the
arguments were...


 Given that there's no real function at 0x5949, the stack frame won't
 have been set up at all, the frame pointer is still pointing to the
 caller's frame, which could be foobar anyway.

The stack frame is set up, since you don't run at all without
a stack, period.  The stack may be corrupt, in this case, but
that's an incredibly rare failure mode recently, and mostly
this still looks like a NULL pointer dereference to me.


 What can be useful is to print out the values on the stack symbolically.
 (in gdb,  p/a ((void **)$sp)[EMAIL PROTECTED] I'm sure ddb can do something
 similar, but no idea how...). And hope to find the caller's return
 address lying in the output.

The best way would be to take a system dump, and then use GDB.

It turns out that, for the most part, you can rebuild a kernel
with the symbols, even if you didn't have one, and the names
you will get back will be nearby; hopefully, though, there's
a kernel.debug lying around for this thing.

In general, we'd be seeing people reporting this all over the
place, loudly, if it wasn't a custom kernel in the first place,
so I'd probably start there.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 5.1, Data Corruption, Intel, Oh my! [patch] - Fatal trap 12

2003-08-14 Thread Peter Edwards
On Wed, 2003-08-13 at 07:38, Terry Lambert wrote:
 Peter Edwards wrote:
   ... He might also want to look for any function pointer
   that takes 5 arguments;
  
  Nice tactic, but misleading in this case, methinks.
  
  I assume your basing this on the 5 arguments shown in the backtrace.
  The 5 arguments passed to the function at 0x5949 is probably just
  defaulted; I doubt it has any significance.
  
  Long version:
  
  ddb tries to work out the number of arguments passed to a function at a
  particular stack frame first based on symbolic information for the
  function itself (obviously not an option here), then based on the
  instruction at the return address in that frame. This works at best
  sporadically in the face of -O compiled C code. The fact that there's no
  function under the (null) would strongly suggest that ddb got confused
  with the frame pointer here and didn't get any useful information with
  which to work out the argument count.
 
 I don't know how accurate this assumption is.  I don't thing
 DDB is confused, because the NULL is consistent with the reported
 fault address.  Even if we assume that it's confused, the PC is
 enough information to locate the function pointer dereference that
 is occurring.  I also have to assume that the function pointer is
 in scope, since it's able to call through it to fault the kernel.

  In the face of failure, ddb just wildly prints out the 5 words under the
  stack pointer.
 
 I did suggest that the correct thing to do would be to decode
 what those words were pointing at, and thereby what types the
 arguments were...

My main point was really just commenting on the your original statement
that He might also want to look for any function pointer that takes 5
arguments.

I was assuming that you were suggesting this based on the fact that the
stack frame containing the (null) indicated 5 arguments passed to the 
function at 0x5949. DDB has no symbol for this address (it's certainly
not a function) and does not know where it returns to (there's is no
function below it on the stack). DDB has no other way of working out how
many arguments were passed in a particular stack frame. As a result, It
is merely showing the first 5 _possible_ argument values to the
function. Agreed?


___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 5.1, Data Corruption, Intel, Oh my! [patch] - Fatal trap 12

2003-08-14 Thread Bosko Milekic

On Mon, Aug 11, 2003 at 11:12:36AM -0500, Mark Johnston wrote:
...
 Using $PIR table, 14 entries at 0xc00fdeb0
 apm0: APM BIOS on motherboard
 kernel trap 12 with interrupts disabled
 
 
 Fatal trap 12: page fault while in kernel mode
 fault virtual address   = 0x0
 fault code  = supervisor read, page not present
 instruction pointer = 0x8:0xc059a773
 stack pointer   = 0x10:0xc0c219e2
 frame pointer   = 0x10:0xc0c21a02
 code segment= base 0x0, limit 0xf, type 0x1b
 = DPL 0, pres 1, def32 1, gran 1
 processor eflags= interrupt enabled, resume, IOPL = 0
 current process = 0 (swapper)
 kernel: type 12 trap, code=0
 Stopped at  _mtx_lock_flags+0x43:   cmpl$0xc07edf8c,0(%ebx)
 db trace
 _mtx_lock_flags(0,0,c07aa287,11e,c0c21aaa) at _mtx_lock_flags+0x43
 vm_fault(c102f000,c000,2,0,c08205c0) at vm_fault+0x2b4
 trap_pfault(c0c21b9e,0,c4d8,10,c4d8) at trap_pfault+0x152
 trap(6c200018,10,1bc40060,1c,0) at trap+0x30d
 calltrap() at calltrap+0x5
 --- trap 0xc, eip = 0x5949, esp = 0xc0c21bde, dbp = 0xc0c21be4 ---
 (null)(1bf80058,0,530e0102,80202,505a61) at 0x5949
 db
 
 The old kernel (Friday's CURRENT plus ATAng) still boots and works fine.
 Please ask for any more information you need or any other steps I can
 take.

 Mark

  Please try without any other patches but just stock -current.  The
  above fault looks like a NULL pointer dereference somewhere.

  Once you sup to recent _stock_ -current (and no other patches), apply
  the patch and then do a 'find /usr/src/sys/ -name *.rej -print' and
  see if any patch hunks were rejected before you build and boot the
  kernel.

  If after all this you still crash like in the above, then please send
  me [off-list] your kernel configuration file as well as a trace (as
  above).

-- 
Bosko Milekic  *  [EMAIL PROTECTED]  *  [EMAIL PROTECTED]
TECHNOkRATIS Consulting Services  *  http://www.technokratis.com/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 5.1, Data Corruption, Intel, Oh my! [patch] - Fatal trap 12

2003-08-14 Thread Peter Edwards
On Tue, 2003-08-12 at 12:52, Terry Lambert wrote:
 Bosko Milekic wrote:
   db trace
   _mtx_lock_flags(0,0,c07aa287,11e,c0c21aaa) at _mtx_lock_flags+0x43
   vm_fault(c102f000,c000,2,0,c08205c0) at vm_fault+0x2b4
   trap_pfault(c0c21b9e,0,c4d8,10,c4d8) at trap_pfault+0x152
   trap(6c200018,10,1bc40060,1c,0) at trap+0x30d
   calltrap() at calltrap+0x5
   --- trap 0xc, eip = 0x5949, esp = 0xc0c21bde, dbp = 0xc0c21be4 ---
   (null)(1bf80058,0,530e0102,80202,505a61) at 0x5949
   db
 
 ... He might also want to look for any function pointer
 that takes 5 arguments; 

Nice tactic, but misleading in this case, methinks.

I assume your basing this on the 5 arguments shown in the backtrace.
The 5 arguments passed to the function at 0x5949 is probably just
defaulted; I doubt it has any significance.

Long version:

ddb tries to work out the number of arguments passed to a function at a
particular stack frame first based on symbolic information for the
function itself (obviously not an option here), then based on the
instruction at the return address in that frame. This works at best
sporadically in the face of -O compiled C code. The fact that there's no
function under the (null) would strongly suggest that ddb got confused
with the frame pointer here and didn't get any useful information with
which to work out the argument count.

In the face of failure, ddb just wildly prints out the 5 words under the
stack pointer.

Given that there's no real function at 0x5949, the stack frame won't
have been set up at all, the frame pointer is still pointing to the
caller's frame, which could be foobar anyway.

What can be useful is to print out the values on the stack symbolically.
(in gdb,  p/a ((void **)$sp)[EMAIL PROTECTED] I'm sure ddb can do something
similar, but no idea how...). And hope to find the caller's return
address lying in the output.

HTH,
Peter.



___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 5.1, Data Corruption, Intel, Oh my! [patch] - Fatal trap 12

2003-08-14 Thread Mark Johnston
Bosko Milekic wrote:
 The information I'd like to see, if possible (private mail is OK):
 
 1) Your hardware, UP or SMP.

UP - the hardware is an IBM Thinkpad A31 with a Pentium 4 M chip.
ACPI is disabled.

 2) Whether you have PAE turned on.

PAE is not enabled.

 3) If you had the data corruption problem, does the patch solve it?
Please make sure you do NOT include the DISABLE_PSE and
DISABLE_PG_G options when you test for this, as they should no
longer be needed.

I didn't notice data corruption, but I haven't heavily stressed this
laptop at all.

I applied this patch to Friday's CURRENT, stock except for Soeren's
ATAng patch.  The result is a crash very early in the boot.  I've
transcribed the message below as accurately as possible, but there may
be typos.  Good dmesg/kernel config files can be had at
http://www.skyweb.ca/~mark/.

Using $PIR table, 14 entries at 0xc00fdeb0
apm0: APM BIOS on motherboard
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x0
fault code  = supervisor read, page not present
instruction pointer = 0x8:0xc059a773
stack pointer   = 0x10:0xc0c219e2
frame pointer   = 0x10:0xc0c21a02
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 0 (swapper)
kernel: type 12 trap, code=0
Stopped at  _mtx_lock_flags+0x43:   cmpl$0xc07edf8c,0(%ebx)
db trace
_mtx_lock_flags(0,0,c07aa287,11e,c0c21aaa) at _mtx_lock_flags+0x43
vm_fault(c102f000,c000,2,0,c08205c0) at vm_fault+0x2b4
trap_pfault(c0c21b9e,0,c4d8,10,c4d8) at trap_pfault+0x152
trap(6c200018,10,1bc40060,1c,0) at trap+0x30d
calltrap() at calltrap+0x5
--- trap 0xc, eip = 0x5949, esp = 0xc0c21bde, dbp = 0xc0c21be4 ---
(null)(1bf80058,0,530e0102,80202,505a61) at 0x5949
db

The old kernel (Friday's CURRENT plus ATAng) still boots and works fine.
Please ask for any more information you need or any other steps I can
take.

Mark

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]