That is strange — the same git repo that built that image was the one that was pushed to Gerrit and merged with master (i.e. it is the _same_ commit). Looking at the current master (as of a few minutes ago), the change doesn’t appear to be stepped on (it’s a very small change — it just moves the setting of a bit mask indicating the CPU has finished starting up to the last thing in the per-cpu startup thread (minus some diagnostic messages after startup and a call to thread_exit()).
Did you try doing a 'gmake clobber’ in your SmartOS repo before building? Unfortunately, incremental (even just rebuilding illumos-joyent) doesn’t always work and can sometimes cause strange behavior. In the meantime, I’ll try doing a full build of the current master and installing it on my server here at home — it was very good at tripping the bug in OS-7079, so I’ll see if I can get it to hang (though it’ll take a bit to do a full build). From: Youzhong Yang <youzh...@gmail.com> Reply: Youzhong Yang <youzh...@gmail.com> Date: August 13, 2018 at 12:27:52 AM To: Jason King <jason.brian.k...@gmail.com> Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself So your image booted up. Interesting ... Maybe something else messed up your fix? Anyway I am now building my image and see what I can get from ::cpustack. On Mon, Aug 13, 2018 at 1:14 AM, Jason King <jason.brian.k...@gmail.com> wrote: Doh.. the problems of it being late :) .. there should be a ‘public’ in there. Try https://us-east.manta.joyent.com/jbk/public/OS-7079/platform-20180719T001516Z.iso From: Youzhong Yang <youzh...@gmail.com> Reply: Youzhong Yang <youzh...@gmail.com> Date: August 13, 2018 at 12:12:52 AM To: Jason King <jason.brian.k...@gmail.com> Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself I got this: {"code":"ResourceNotFound","message":"/jbk/OS-7079/platform-20180719T001516Z.iso does not exist"} In our /etc/system, I have set pcplusmp:apic_panic_on_nmi=1 set apix:apic_panic_on_nmi=1 If I set them to 0, and boot with -k, a NMI should drop into kmdb, right? I will build an image now and test. On Mon, Aug 13, 2018 at 1:04 AM, Jason King <jason.brian.k...@gmail.com> wrote: There’s a couple of ways — you can boot -kd and set a breakpoint to set it. You can also set it in etc/system in the proto area when building an image. If you want, I do have an image of 20180719 w/ OS-7079 applied and kmdb on NMI already set (you’d still want to boot -k) — you can grab it at https://us-east.manta.joyent.com/jbk/OS-7079/platform-20180719T001516Z.{iso,tgz,usb.bz2} If you do, it’d be interesting to see ::cpustack on each core looks like. From: Youzhong Yang <youzh...@gmail.com> Reply: Youzhong Yang <youzh...@gmail.com> Date: August 12, 2018 at 11:58:48 PM To: Jason King <jason.brian.k...@gmail.com> Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself I sent NMI, but it printed out a stack trace plus a message "no dump device" or something then rebooted. I tried -v on my old supermicro system, on the console I saw message about sd## devices, then it hung. The console still responded to keyboard, but just stayed that way forever. What change is needed to drop into kmdb when the OS receives NMI? On Mon, Aug 13, 2018 at 12:06 AM, Jason King <jason.brian.k...@gmail.com> wrote: Was that with boot -v? Are you able to send the system an NMI after it hangs (or get the boot -v output up to the hang)? Prior to OS-7079, the system would start to startup the next CPU before it had completely finished initializing the ‘current’ CPU (which could deadlock depending on which CPU obtained a particular lock first), the change makes it wait until the current CPU is finished starting up before proceeding to the next CPU. It’s certainly possible it could have revealed another bug — OS-7079 itself was introduced almost 10 years ago, but didn’t seem to be easy to trigger until recent CPUs. From: Youzhong Yang <youzh...@gmail.com> Reply: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Date: August 12, 2018 at 10:46:05 PM To: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself Today I built a smartos image (with all git repos synced to master) and rebooted the host with that image. It hung after the banner message + one more line about power management or something. Then I reverted OS-7079, built an image, rebooted, it worked perfectly. So does it mean OS-7079 fixed one issue, but caused another? My host is an old Supermicro X8DAH, Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. Tomorrow I will try on a new all NVMe system and see if it works. Thanks. smartos-discuss | Archives | Modify Your Subscription
signature.asc
Description: Message signed with OpenPGP using AMPGpg
smime.p7s
Description: S/MIME cryptographic signature
------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125 Powered by Listbox: https://www.listbox.com