Hi Jason, The image that had the issue was a full build, but last night I played git reset --hard, git pull etc.., the image I got didn't reproduce the hang. So Now I am starting all over again and see if I can reproduce.
Thanks. On Mon, Aug 13, 2018 at 10:20 AM, Jason King <[email protected]> wrote: > That is strange — the same git repo that built that image was the one that > was pushed to Gerrit and merged with master (i.e. it is the _same_ > commit). Looking at the current master (as of a few minutes ago), the > change doesn’t appear to be stepped on (it’s a very small change — it just > moves the setting of a bit mask indicating the CPU has finished starting up > to the last thing in the per-cpu startup thread (minus some diagnostic > messages after startup and a call to thread_exit()). > > Did you try doing a 'gmake clobber’ in your SmartOS repo before building? > Unfortunately, incremental (even just rebuilding illumos-joyent) doesn’t > always work and can sometimes cause strange behavior. > > In the meantime, I’ll try doing a full build of the current master and > installing it on my server here at home — it was very good at tripping the > bug in OS-7079, so I’ll see if I can get it to hang (though it’ll take a > bit to do a full build). > > > From: Youzhong Yang <[email protected]> <[email protected]> > Reply: Youzhong Yang <[email protected]> <[email protected]> > Date: August 13, 2018 at 12:27:52 AM > > To: Jason King <[email protected]> <[email protected]> > Cc: [email protected] <[email protected]> > <[email protected]> > Subject: Re: [smartos-discuss] still hang at boot - OS-7079 > mp_startup_common races itself > > So your image booted up. Interesting ... Maybe something else messed up > your fix? > Anyway I am now building my image and see what I can get from ::cpustack. > > On Mon, Aug 13, 2018 at 1:14 AM, Jason King <[email protected]> > wrote: > >> Doh.. the problems of it being late :) .. there should be a ‘public’ in >> there. >> >> Try >> >> https://us-east.manta.joyent.com/jbk/public/OS-7079/platform >> -20180719T001516Z.iso >> >> >> From: Youzhong Yang <[email protected]> <[email protected]> >> Reply: Youzhong Yang <[email protected]> <[email protected]> >> Date: August 13, 2018 at 12:12:52 AM >> >> To: Jason King <[email protected]> <[email protected]> >> Cc: [email protected] <[email protected]> >> <[email protected]> >> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 >> mp_startup_common races itself >> >> I got this: >> >> {"code":"ResourceNotFound","message":"/jbk/OS-7079/platform-20180719T001516Z.iso >> does not exist"} >> >> In our /etc/system, I have >> set pcplusmp:apic_panic_on_nmi=1 >> set apix:apic_panic_on_nmi=1 >> >> If I set them to 0, and boot with -k, a NMI should drop into kmdb, right? >> I will build an image now and test. >> >> >> On Mon, Aug 13, 2018 at 1:04 AM, Jason King <[email protected]> >> wrote: >> >>> There’s a couple of ways — you can boot -kd and set a breakpoint to set >>> it. You can also set it in etc/system in the proto area when building an >>> image. >>> >>> If you want, I do have an image of 20180719 w/ OS-7079 applied and kmdb >>> on NMI already set (you’d still want to boot -k) — you can grab it at >>> https://us-east.manta.joyent.com/jbk/OS-7079/platform-201807 >>> 19T001516Z.{iso,tgz,usb.bz2} >>> >>> If you do, it’d be interesting to see ::cpustack on each core looks like. >>> >>> >>> From: Youzhong Yang <[email protected]> <[email protected]> >>> Reply: Youzhong Yang <[email protected]> <[email protected]> >>> Date: August 12, 2018 at 11:58:48 PM >>> To: Jason King <[email protected]> <[email protected]> >>> Cc: [email protected] <[email protected] >>> .org> <[email protected]> >>> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 >>> mp_startup_common races itself >>> >>> I sent NMI, but it printed out a stack trace plus a message "no dump >>> device" or something then rebooted. I tried -v on my old supermicro system, >>> on the console I saw message about sd## devices, then it hung. The console >>> still responded to keyboard, but just stayed that way forever. >>> >>> What change is needed to drop into kmdb when the OS receives NMI? >>> >>> On Mon, Aug 13, 2018 at 12:06 AM, Jason King <[email protected] >>> > wrote: >>> >>>> Was that with boot -v? Are you able to send the system an NMI after it >>>> hangs (or get the boot -v output up to the hang)? >>>> >>>> Prior to OS-7079, the system would start to startup the next CPU before >>>> it had completely finished initializing the ‘current’ CPU (which could >>>> deadlock depending on which CPU obtained a particular lock first), the >>>> change makes it wait until the current CPU is finished starting up before >>>> proceeding to the next CPU. >>>> >>>> It’s certainly possible it could have revealed another bug — OS-7079 >>>> itself was introduced almost 10 years ago, but didn’t seem to be easy to >>>> trigger until recent CPUs. >>>> >>>> >>>> From: Youzhong Yang <[email protected]> <[email protected]> >>>> Reply: [email protected] <[email protected] >>>> .org> <[email protected]> >>>> Date: August 12, 2018 at 10:46:05 PM >>>> To: [email protected] <[email protected] >>>> .org> <[email protected]> >>>> Subject: [smartos-discuss] still hang at boot - OS-7079 >>>> mp_startup_common races itself >>>> >>>> Today I built a smartos image (with all git repos synced to master) and >>>> rebooted the host with that image. It hung after the banner message + one >>>> more line about power management or something. >>>> >>>> Then I reverted OS-7079, built an image, rebooted, it worked perfectly. >>>> >>>> So does it mean OS-7079 fixed one issue, but caused another? My host is >>>> an old Supermicro X8DAH, Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. Tomorrow I >>>> will try on a new all NVMe system and see if it works. >>>> >>>> Thanks. >>>> *smartos-discuss* | Archives >>>> <https://www.listbox.com/member/archive/184463/=now> | Modify >>>> <https://www.listbox.com/member/?> Your Subscription >>>> <https://www.listbox.com> >>>> >>>> >>> >> > ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125 Powered by Listbox: https://www.listbox.com
