I finally figured out it's my issue. My branch has a fix for
https://www.illumos.org/issues/7395, when merging the master branch, the
upstream stuff stepped over my fix and caused the boot to hang. Sorry for
the fuss, and thanks a lot for your attention and kindness.


On Mon, Aug 13, 2018 at 1:43 PM, Jason King <[email protected]>
wrote:

> I also built master from this morning 
> (745f0bf63f3e20e5c1f0b83d85eaa4e99efcf441)
> and so far has been ok.
>
> If you do need to do kmdb on nmi, the easiest way I’ve found is something
> similar to this diff (in smartos-live):
>
> diff --git a/overlay/generic/etc/system b/overlay/generic/etc/system
> index 62b96ef..2fded09 100644
> --- a/overlay/generic/etc/system
> +++ b/overlay/generic/etc/system
> @@ -113,8 +113,8 @@ set ip:ip_squeue_fanout=1
>  *
>  * Machines should take a crash dump and reboot when receiving an NMI
>  *
> -set pcplusmp:apic_panic_on_nmi=1
> -set apix:apic_panic_on_nmi=1
> +set pcplusmp:apic_kmdb_on_nmi=1
> +set apix:apic_kmdb_on_nmi=1
>
>  *
>  * Don't use multi-threaded fast crash dump or a high compression level
>
>
> You can boot -k and do some kmdb tricks to set a breakpoint at the right
> spot and set those in kmdb (you need to set a breakpoint some point after
> the modules are loaded, but before it hits the hang), but I found it was a
> bigger hassle to get it right than just producing a special-purpose image
> (I just keep a branch with just the above change I can rebase/checkout/etc
> as needed) — though maybe those with better mdb-fu than me have better luck
> with that approach.
>
>
> From: Youzhong Yang <[email protected]> <[email protected]>
> Reply: Youzhong Yang <[email protected]> <[email protected]>
> Date: August 13, 2018 at 9:29:07 AM
>
> To: Jason King <[email protected]> <[email protected]>
> Cc: [email protected] <[email protected]>
> <[email protected]>
> Subject:  Re: [smartos-discuss] still hang at boot - OS-7079
> mp_startup_common races itself
>
> Hi Jason,
>
> The image that had the issue was a full build, but last night I played git
> reset --hard, git pull etc.., the image I got didn't reproduce the hang. So
> Now I am starting all over again and see if I can reproduce.
>
> Thanks.
>
> On Mon, Aug 13, 2018 at 10:20 AM, Jason King <[email protected]>
> wrote:
>
>> That is strange — the same git repo that built that image was the one
>> that was pushed to Gerrit and merged with master (i.e. it is the _same_
>> commit).  Looking at the current master (as of a few minutes ago), the
>> change doesn’t appear to be stepped on (it’s a very small change — it just
>> moves the setting of a bit mask indicating the CPU has finished starting up
>> to the last thing in the per-cpu startup thread (minus some diagnostic
>> messages after startup and a call to thread_exit()).
>>
>> Did you try doing a 'gmake clobber’ in your SmartOS repo before
>> building?  Unfortunately, incremental (even just rebuilding illumos-joyent)
>> doesn’t always work and can sometimes cause strange behavior.
>>
>> In the meantime, I’ll try doing a full build of the current master and
>> installing it on my server here at home — it was very good at tripping the
>> bug in OS-7079, so I’ll see if I can get it to hang (though it’ll take a
>> bit to do a full build).
>>
>>
>> From: Youzhong Yang <[email protected]> <[email protected]>
>> Reply: Youzhong Yang <[email protected]> <[email protected]>
>> Date: August 13, 2018 at 12:27:52 AM
>>
>> To: Jason King <[email protected]> <[email protected]>
>> Cc: [email protected] <[email protected]>
>> <[email protected]>
>> Subject:  Re: [smartos-discuss] still hang at boot - OS-7079
>> mp_startup_common races itself
>>
>> So your image booted up. Interesting ... Maybe something else messed up
>> your fix?
>> Anyway I am now building my image and see what I can get from ::cpustack.
>>
>> On Mon, Aug 13, 2018 at 1:14 AM, Jason King <[email protected]>
>> wrote:
>>
>>> Doh.. the problems of it being late :) .. there should be a ‘public’ in
>>> there.
>>>
>>> Try
>>>
>>> https://us-east.manta.joyent.com/jbk/public/OS-7079/platform
>>> -20180719T001516Z.iso
>>>
>>>
>>> From: Youzhong Yang <[email protected]> <[email protected]>
>>> Reply: Youzhong Yang <[email protected]> <[email protected]>
>>> Date: August 13, 2018 at 12:12:52 AM
>>>
>>> To: Jason King <[email protected]> <[email protected]>
>>> Cc: [email protected] <[email protected]
>>> .org> <[email protected]>
>>> Subject:  Re: [smartos-discuss] still hang at boot - OS-7079
>>> mp_startup_common races itself
>>>
>>> I got this:
>>>
>>> {"code":"ResourceNotFound","message":"/jbk/OS-7079/platform-20180719T001516Z.iso
>>> does not exist"}
>>>
>>> In our /etc/system, I have
>>> set pcplusmp:apic_panic_on_nmi=1
>>> set apix:apic_panic_on_nmi=1
>>>
>>> If I set them to 0, and boot with -k, a NMI should drop into kmdb,
>>> right? I will build an image now and test.
>>>
>>>
>>> On Mon, Aug 13, 2018 at 1:04 AM, Jason King <[email protected]>
>>> wrote:
>>>
>>>> There’s a couple of ways — you can boot -kd and set a breakpoint to set
>>>> it.  You can also set it in etc/system in the proto area when building an
>>>> image.
>>>>
>>>> If you want, I do have an image of 20180719 w/ OS-7079 applied and kmdb
>>>> on NMI already set (you’d still want to boot -k)  — you can grab it at
>>>> https://us-east.manta.joyent.com/jbk/OS-7079/platform-201807
>>>> 19T001516Z.{iso,tgz,usb.bz2}
>>>>
>>>> If you do, it’d be interesting to see ::cpustack on each core looks
>>>> like.
>>>>
>>>>
>>>> From: Youzhong Yang <[email protected]> <[email protected]>
>>>> Reply: Youzhong Yang <[email protected]> <[email protected]>
>>>> Date: August 12, 2018 at 11:58:48 PM
>>>> To: Jason King <[email protected]>
>>>> <[email protected]>
>>>> Cc: [email protected] <[email protected]
>>>> .org> <[email protected]>
>>>> Subject:  Re: [smartos-discuss] still hang at boot - OS-7079
>>>> mp_startup_common races itself
>>>>
>>>> I sent NMI, but it printed out a stack trace plus a message "no dump
>>>> device" or something then rebooted. I tried -v on my old supermicro system,
>>>> on the console I saw message about sd## devices, then it hung. The console
>>>> still responded to keyboard, but just stayed that way forever.
>>>>
>>>> What change is needed to drop into kmdb when the OS receives NMI?
>>>>
>>>> On Mon, Aug 13, 2018 at 12:06 AM, Jason King <
>>>> [email protected]> wrote:
>>>>
>>>>> Was that with boot -v?  Are you able to send the system an NMI after
>>>>> it hangs (or get the boot -v output up to the hang)?
>>>>>
>>>>> Prior to OS-7079, the system would start to startup the next CPU
>>>>> before it had completely finished initializing the ‘current’ CPU (which
>>>>> could deadlock depending on which CPU obtained a particular lock first),
>>>>> the change makes it wait until the current CPU is finished starting up
>>>>> before proceeding to the next CPU.
>>>>>
>>>>> It’s certainly possible it could have revealed another bug — OS-7079
>>>>> itself was introduced almost 10 years ago, but didn’t seem to be easy to
>>>>> trigger until recent CPUs.
>>>>>
>>>>>
>>>>> From: Youzhong Yang <[email protected]> <[email protected]>
>>>>> Reply: [email protected]
>>>>> <[email protected]>
>>>>> <[email protected]>
>>>>> Date: August 12, 2018 at 10:46:05 PM
>>>>> To: [email protected] <[email protected]
>>>>> .org> <[email protected]>
>>>>> Subject:  [smartos-discuss] still hang at boot - OS-7079
>>>>> mp_startup_common races itself
>>>>>
>>>>> Today I built a smartos image (with all git repos synced to master)
>>>>> and rebooted the host with that image. It hung after the banner message +
>>>>> one more line about power management or something.
>>>>>
>>>>> Then I reverted OS-7079, built an image, rebooted, it worked perfectly.
>>>>>
>>>>> So does it mean OS-7079 fixed one issue, but caused another? My host
>>>>> is an old Supermicro X8DAH, Intel(R) Xeon(R) CPU X5570  @ 2.93GHz. 
>>>>> Tomorrow
>>>>> I will try on a new all NVMe system and see if it works.
>>>>>
>>>>> Thanks.
>>>>> *smartos-discuss* | Archives
>>>>> <https://www.listbox.com/member/archive/184463/=now> | Modify
>>>>> <https://www.listbox.com/member/?> Your Subscription
>>>>> <https://www.listbox.com>
>>>>>
>>>>>
>>>>
>>>
>>
>



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125
Powered by Listbox: https://www.listbox.com

Reply via email to