On Fri, May 22, 2009 at 9:29 PM, Mike Gerdts <mger...@gmail.com> wrote:
> On Fri, May 22, 2009 at 6:52 PM, Jason King <ja...@ansipunx.net> wrote:
>> On Fri, May 22, 2009 at 5:46 PM, Mike Gerdts <mger...@gmail.com> wrote:
>>> I've encountered something that confuses me in a core file because dbx
>>> and mdb seem to be telling me different things.  According to dbx:
>>>
>>> $ dbx /.../httpd core
>>> ...
>>> t...@3 (l...@3) terminated by signal ILL (Illegal Instruction)
>>> 0xff046df0: __lwp_park+0x0014:  bcc,a,pt  %icc,__lwp_park+0x24  ! 0xff046e00
>>> ...
>>> (dbx) dis __lwp_park
>>> 0xff046ddc: __lwp_park       :  mov      %o1, %o2
>>> 0xff046de0: __lwp_park+0x0004:  mov      %o0, %o1
>>> 0xff046de4: __lwp_park+0x0008:  clr      %o0
>>> 0xff046de8: __lwp_park+0x000c:  mov      77, %g1
>>> 0xff046dec: __lwp_park+0x0010:  ta       8
>>> 0xff046df0: __lwp_park+0x0014:  bcc,a,pt  %icc,__lwp_park+0x24  ! 0xff046e00
>>> 0xff046df4: __lwp_park+0x0018:  clr      %o0
>>> 0xff046df8: __lwp_park+0x001c:  cmp      %o0, 91
>>> 0xff046dfc: __lwp_park+0x0020:  move     %icc,0x4, %o0
>>> 0xff046e00: __lwp_park+0x0024:  retl
>>>
>>> Notice that there is no indication that __lwp_park_0x14 has anything
>>> wrong with it.  This appears to be hand-coded assembly that hasn't
>>> changed for a long time.
>>>
>>> However, when I look at it with mdb, I see:
>>>
>>> # mdb core
>>> Loading modules: [ libc.so.1 libuutil.so.1 ld.so.1 ]
>>>> ::stack
>>> libc.so.1`__lwp_park+0x14(0, 174058, 0, 0, 7d89c, 0)
>>> libc.so.1`cond_wait_queue+0x4c(174098, 174058, 0, 0, 1c00, 0)
>>> libc.so.1`cond_wait+0x10(174098, 174058, 0, 1c00, 0, 1741b8)
>>> libc.so.1`pthread_cond_wait+8(174098, 174058, 0, 0, 174058, ff0402a0)
>>> libapr-0.so.0.9.4`apr_thread_cond_wait+0x44(174090, 174050, 2aa688,
>>> 240, 2aa780, 34e820)
>>> ap_queue_pop+0x78(174038, fe2fbf1c, fe2fbf10, 0, 34e820, ff0c5480)
>>> worker_thread+0x160(1742f8, 1a4858, 0, 0, feb40a00, 1)
>>> libapr-0.so.0.9.4`dummy_worker+0x48(1742f8, fe2fc000, 0, 0, ff2d0a58, 1)
>>> libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>>>> ::status
>>> debugging core file of httpd (32-bit) from XXX
>>> file: /.../httpd
>>> initial argv: /.../httpd ...
>>> threading model: multi-threaded
>>> status: process terminated by SIGILL (Illegal Instruction)
>>>> libc.so.1`__lwp_park::dis
>>> libc.so.1`__lwp_park:           mov       %o1, %o2
>>> libc.so.1`__lwp_park+4:         mov       %o0, %o1
>>> libc.so.1`__lwp_park+8:         clr       %o0
>>> libc.so.1`__lwp_park+0xc:       mov       0x4d, %g1
>>> libc.so.1`__lwp_park+0x10:      ta        0x8
>>> libc.so.1`__lwp_park+0x14:      0x3a480004   <<< Notice that this
>>> instruction is not decoded!
>>> libc.so.1`__lwp_park+0x18:      clr       %o0
>>> libc.so.1`__lwp_park+0x1c:      cmp       %o0, 0x5b
>>> libc.so.1`__lwp_park+0x20:      0x91646004
>>> libc.so.1`__lwp_park+0x24:      retl
>>> libc.so.1`__lwp_park+0x28:      nop
>>>
>>> According to mdb, it looks to me like someone clobbered __lwp+0x14.
>>> However, pmap suggests this shouldn't be possible because the memory
>>> segment is readable and executable but not writable.
>>>
>>> $ pmap core
>>> ....
>>> FEF80000    1152K r-x--  /lib/libc.so.1
>>> ...
>>>
>>> >From the dbx output in the beginning of this message, we could see
>>> that __lwp_park starts at address 0xff046df0, which is between
>>> FEF80000 and FF0A0000 (FF0A0000 = FEF80000 + 1152k).
>>>
>>> Why do mdb and dbx disagree on the contents of __lwp_park+0x14 (and
>>> +0x20)?  Is mdb reading the core and dbx reading the executable and
>>> libs from the file system?  Does modification to a read-only memory
>>> segment suggest a hardware error?
>>
>> That is the behavior when an invalid instruction (for the given
>> disassembly mode) is given (which should match the old disassembler
>> behavior). 0x3a480004 is the value for 'bcc,a,pt +0x10'
>> (__lwp_park+0x24), but that is a sparcv9 instruction, not a v8, and
>> the disassembler is very strict in how it decodes instructions :)
>>
>> Try ::dismode v9 and doing it again.  It should agree -- it tested
>> using /usr/bin/dis on libc.so on a recent box (mdb use the same code
>> to disassemble, however I do not know what dbx is using).
>>
>
> That did the trick.
>
> However, I'm extremely confused as to why the process died with SIGILL
> when trying to execute this instruction.

It does sound interesting -- I'm not super familiar with the chips,
but the only way I could see it happening is if there was some strict
v8 mode on the chip (or worse, a hw bug!).  But AFAIK, all recent
sparc chips (at least from Sun or Fujitsu) even 32bit mode supports
v8+ (thus should be valid).

Can you reproduce the crash consistently (not that I can help much,
but I'm curious myself)?
_______________________________________________
tools-discuss mailing list
tools-discuss@opensolaris.org

Reply via email to