I finally figured out it's my issue. My branch has a fix for https://www.illumos.org/issues/7395, when merging the master branch, the upstream stuff stepped over my fix and caused the boot to hang. Sorry for the fuss, and thanks a lot for your attention and kindness.
On Mon, Aug 13, 2018 at 1:43 PM, Jason King <[email protected]> wrote: > I also built master from this morning > (745f0bf63f3e20e5c1f0b83d85eaa4e99efcf441) > and so far has been ok. > > If you do need to do kmdb on nmi, the easiest way I’ve found is something > similar to this diff (in smartos-live): > > diff --git a/overlay/generic/etc/system b/overlay/generic/etc/system > index 62b96ef..2fded09 100644 > --- a/overlay/generic/etc/system > +++ b/overlay/generic/etc/system > @@ -113,8 +113,8 @@ set ip:ip_squeue_fanout=1 > * > * Machines should take a crash dump and reboot when receiving an NMI > * > -set pcplusmp:apic_panic_on_nmi=1 > -set apix:apic_panic_on_nmi=1 > +set pcplusmp:apic_kmdb_on_nmi=1 > +set apix:apic_kmdb_on_nmi=1 > > * > * Don't use multi-threaded fast crash dump or a high compression level > > > You can boot -k and do some kmdb tricks to set a breakpoint at the right > spot and set those in kmdb (you need to set a breakpoint some point after > the modules are loaded, but before it hits the hang), but I found it was a > bigger hassle to get it right than just producing a special-purpose image > (I just keep a branch with just the above change I can rebase/checkout/etc > as needed) — though maybe those with better mdb-fu than me have better luck > with that approach. > > > From: Youzhong Yang <[email protected]> <[email protected]> > Reply: Youzhong Yang <[email protected]> <[email protected]> > Date: August 13, 2018 at 9:29:07 AM > > To: Jason King <[email protected]> <[email protected]> > Cc: [email protected] <[email protected]> > <[email protected]> > Subject: Re: [smartos-discuss] still hang at boot - OS-7079 > mp_startup_common races itself > > Hi Jason, > > The image that had the issue was a full build, but last night I played git > reset --hard, git pull etc.., the image I got didn't reproduce the hang. So > Now I am starting all over again and see if I can reproduce. > > Thanks. > > On Mon, Aug 13, 2018 at 10:20 AM, Jason King <[email protected]> > wrote: > >> That is strange — the same git repo that built that image was the one >> that was pushed to Gerrit and merged with master (i.e. it is the _same_ >> commit). Looking at the current master (as of a few minutes ago), the >> change doesn’t appear to be stepped on (it’s a very small change — it just >> moves the setting of a bit mask indicating the CPU has finished starting up >> to the last thing in the per-cpu startup thread (minus some diagnostic >> messages after startup and a call to thread_exit()). >> >> Did you try doing a 'gmake clobber’ in your SmartOS repo before >> building? Unfortunately, incremental (even just rebuilding illumos-joyent) >> doesn’t always work and can sometimes cause strange behavior. >> >> In the meantime, I’ll try doing a full build of the current master and >> installing it on my server here at home — it was very good at tripping the >> bug in OS-7079, so I’ll see if I can get it to hang (though it’ll take a >> bit to do a full build). >> >> >> From: Youzhong Yang <[email protected]> <[email protected]> >> Reply: Youzhong Yang <[email protected]> <[email protected]> >> Date: August 13, 2018 at 12:27:52 AM >> >> To: Jason King <[email protected]> <[email protected]> >> Cc: [email protected] <[email protected]> >> <[email protected]> >> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 >> mp_startup_common races itself >> >> So your image booted up. Interesting ... Maybe something else messed up >> your fix? >> Anyway I am now building my image and see what I can get from ::cpustack. >> >> On Mon, Aug 13, 2018 at 1:14 AM, Jason King <[email protected]> >> wrote: >> >>> Doh.. the problems of it being late :) .. there should be a ‘public’ in >>> there. >>> >>> Try >>> >>> https://us-east.manta.joyent.com/jbk/public/OS-7079/platform >>> -20180719T001516Z.iso >>> >>> >>> From: Youzhong Yang <[email protected]> <[email protected]> >>> Reply: Youzhong Yang <[email protected]> <[email protected]> >>> Date: August 13, 2018 at 12:12:52 AM >>> >>> To: Jason King <[email protected]> <[email protected]> >>> Cc: [email protected] <[email protected] >>> .org> <[email protected]> >>> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 >>> mp_startup_common races itself >>> >>> I got this: >>> >>> {"code":"ResourceNotFound","message":"/jbk/OS-7079/platform-20180719T001516Z.iso >>> does not exist"} >>> >>> In our /etc/system, I have >>> set pcplusmp:apic_panic_on_nmi=1 >>> set apix:apic_panic_on_nmi=1 >>> >>> If I set them to 0, and boot with -k, a NMI should drop into kmdb, >>> right? I will build an image now and test. >>> >>> >>> On Mon, Aug 13, 2018 at 1:04 AM, Jason King <[email protected]> >>> wrote: >>> >>>> There’s a couple of ways — you can boot -kd and set a breakpoint to set >>>> it. You can also set it in etc/system in the proto area when building an >>>> image. >>>> >>>> If you want, I do have an image of 20180719 w/ OS-7079 applied and kmdb >>>> on NMI already set (you’d still want to boot -k) — you can grab it at >>>> https://us-east.manta.joyent.com/jbk/OS-7079/platform-201807 >>>> 19T001516Z.{iso,tgz,usb.bz2} >>>> >>>> If you do, it’d be interesting to see ::cpustack on each core looks >>>> like. >>>> >>>> >>>> From: Youzhong Yang <[email protected]> <[email protected]> >>>> Reply: Youzhong Yang <[email protected]> <[email protected]> >>>> Date: August 12, 2018 at 11:58:48 PM >>>> To: Jason King <[email protected]> >>>> <[email protected]> >>>> Cc: [email protected] <[email protected] >>>> .org> <[email protected]> >>>> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 >>>> mp_startup_common races itself >>>> >>>> I sent NMI, but it printed out a stack trace plus a message "no dump >>>> device" or something then rebooted. I tried -v on my old supermicro system, >>>> on the console I saw message about sd## devices, then it hung. The console >>>> still responded to keyboard, but just stayed that way forever. >>>> >>>> What change is needed to drop into kmdb when the OS receives NMI? >>>> >>>> On Mon, Aug 13, 2018 at 12:06 AM, Jason King < >>>> [email protected]> wrote: >>>> >>>>> Was that with boot -v? Are you able to send the system an NMI after >>>>> it hangs (or get the boot -v output up to the hang)? >>>>> >>>>> Prior to OS-7079, the system would start to startup the next CPU >>>>> before it had completely finished initializing the ‘current’ CPU (which >>>>> could deadlock depending on which CPU obtained a particular lock first), >>>>> the change makes it wait until the current CPU is finished starting up >>>>> before proceeding to the next CPU. >>>>> >>>>> It’s certainly possible it could have revealed another bug — OS-7079 >>>>> itself was introduced almost 10 years ago, but didn’t seem to be easy to >>>>> trigger until recent CPUs. >>>>> >>>>> >>>>> From: Youzhong Yang <[email protected]> <[email protected]> >>>>> Reply: [email protected] >>>>> <[email protected]> >>>>> <[email protected]> >>>>> Date: August 12, 2018 at 10:46:05 PM >>>>> To: [email protected] <[email protected] >>>>> .org> <[email protected]> >>>>> Subject: [smartos-discuss] still hang at boot - OS-7079 >>>>> mp_startup_common races itself >>>>> >>>>> Today I built a smartos image (with all git repos synced to master) >>>>> and rebooted the host with that image. It hung after the banner message + >>>>> one more line about power management or something. >>>>> >>>>> Then I reverted OS-7079, built an image, rebooted, it worked perfectly. >>>>> >>>>> So does it mean OS-7079 fixed one issue, but caused another? My host >>>>> is an old Supermicro X8DAH, Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. >>>>> Tomorrow >>>>> I will try on a new all NVMe system and see if it works. >>>>> >>>>> Thanks. >>>>> *smartos-discuss* | Archives >>>>> <https://www.listbox.com/member/archive/184463/=now> | Modify >>>>> <https://www.listbox.com/member/?> Your Subscription >>>>> <https://www.listbox.com> >>>>> >>>>> >>>> >>> >> > ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125 Powered by Listbox: https://www.listbox.com
