I also built master from this morning (745f0bf63f3e20e5c1f0b83d85eaa4e99efcf441) and so far has been ok.
If you do need to do kmdb on nmi, the easiest way I’ve found is something similar to this diff (in smartos-live): diff --git a/overlay/generic/etc/system b/overlay/generic/etc/system index 62b96ef..2fded09 100644 --- a/overlay/generic/etc/system +++ b/overlay/generic/etc/system @@ -113,8 +113,8 @@ set ip:ip_squeue_fanout=1 * * Machines should take a crash dump and reboot when receiving an NMI * -set pcplusmp:apic_panic_on_nmi=1 -set apix:apic_panic_on_nmi=1 +set pcplusmp:apic_kmdb_on_nmi=1 +set apix:apic_kmdb_on_nmi=1 * * Don't use multi-threaded fast crash dump or a high compression level You can boot -k and do some kmdb tricks to set a breakpoint at the right spot and set those in kmdb (you need to set a breakpoint some point after the modules are loaded, but before it hits the hang), but I found it was a bigger hassle to get it right than just producing a special-purpose image (I just keep a branch with just the above change I can rebase/checkout/etc as needed) — though maybe those with better mdb-fu than me have better luck with that approach. From: Youzhong Yang <youzh...@gmail.com> Reply: Youzhong Yang <youzh...@gmail.com> Date: August 13, 2018 at 9:29:07 AM To: Jason King <jason.brian.k...@gmail.com> Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself Hi Jason, The image that had the issue was a full build, but last night I played git reset --hard, git pull etc.., the image I got didn't reproduce the hang. So Now I am starting all over again and see if I can reproduce. Thanks. On Mon, Aug 13, 2018 at 10:20 AM, Jason King <jason.brian.k...@gmail.com> wrote: That is strange — the same git repo that built that image was the one that was pushed to Gerrit and merged with master (i.e. it is the _same_ commit). Looking at the current master (as of a few minutes ago), the change doesn’t appear to be stepped on (it’s a very small change — it just moves the setting of a bit mask indicating the CPU has finished starting up to the last thing in the per-cpu startup thread (minus some diagnostic messages after startup and a call to thread_exit()). Did you try doing a 'gmake clobber’ in your SmartOS repo before building? Unfortunately, incremental (even just rebuilding illumos-joyent) doesn’t always work and can sometimes cause strange behavior. In the meantime, I’ll try doing a full build of the current master and installing it on my server here at home — it was very good at tripping the bug in OS-7079, so I’ll see if I can get it to hang (though it’ll take a bit to do a full build). From: Youzhong Yang <youzh...@gmail.com> Reply: Youzhong Yang <youzh...@gmail.com> Date: August 13, 2018 at 12:27:52 AM To: Jason King <jason.brian.k...@gmail.com> Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself So your image booted up. Interesting ... Maybe something else messed up your fix? Anyway I am now building my image and see what I can get from ::cpustack. On Mon, Aug 13, 2018 at 1:14 AM, Jason King <jason.brian.k...@gmail.com> wrote: Doh.. the problems of it being late :) .. there should be a ‘public’ in there. Try https://us-east.manta.joyent.com/jbk/public/OS-7079/platform-20180719T001516Z.iso From: Youzhong Yang <youzh...@gmail.com> Reply: Youzhong Yang <youzh...@gmail.com> Date: August 13, 2018 at 12:12:52 AM To: Jason King <jason.brian.k...@gmail.com> Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself I got this: {"code":"ResourceNotFound","message":"/jbk/OS-7079/platform-20180719T001516Z.iso does not exist"} In our /etc/system, I have set pcplusmp:apic_panic_on_nmi=1 set apix:apic_panic_on_nmi=1 If I set them to 0, and boot with -k, a NMI should drop into kmdb, right? I will build an image now and test. On Mon, Aug 13, 2018 at 1:04 AM, Jason King <jason.brian.k...@gmail.com> wrote: There’s a couple of ways — you can boot -kd and set a breakpoint to set it. You can also set it in etc/system in the proto area when building an image. If you want, I do have an image of 20180719 w/ OS-7079 applied and kmdb on NMI already set (you’d still want to boot -k) — you can grab it at https://us-east.manta.joyent.com/jbk/OS-7079/platform-20180719T001516Z.{iso,tgz,usb.bz2} If you do, it’d be interesting to see ::cpustack on each core looks like. From: Youzhong Yang <youzh...@gmail.com> Reply: Youzhong Yang <youzh...@gmail.com> Date: August 12, 2018 at 11:58:48 PM To: Jason King <jason.brian.k...@gmail.com> Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself I sent NMI, but it printed out a stack trace plus a message "no dump device" or something then rebooted. I tried -v on my old supermicro system, on the console I saw message about sd## devices, then it hung. The console still responded to keyboard, but just stayed that way forever. What change is needed to drop into kmdb when the OS receives NMI? On Mon, Aug 13, 2018 at 12:06 AM, Jason King <jason.brian.k...@gmail.com> wrote: Was that with boot -v? Are you able to send the system an NMI after it hangs (or get the boot -v output up to the hang)? Prior to OS-7079, the system would start to startup the next CPU before it had completely finished initializing the ‘current’ CPU (which could deadlock depending on which CPU obtained a particular lock first), the change makes it wait until the current CPU is finished starting up before proceeding to the next CPU. It’s certainly possible it could have revealed another bug — OS-7079 itself was introduced almost 10 years ago, but didn’t seem to be easy to trigger until recent CPUs. From: Youzhong Yang <youzh...@gmail.com> Reply: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Date: August 12, 2018 at 10:46:05 PM To: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org> Subject: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common races itself Today I built a smartos image (with all git repos synced to master) and rebooted the host with that image. It hung after the banner message + one more line about power management or something. Then I reverted OS-7079, built an image, rebooted, it worked perfectly. So does it mean OS-7079 fixed one issue, but caused another? My host is an old Supermicro X8DAH, Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. Tomorrow I will try on a new all NVMe system and see if it works. Thanks. smartos-discuss | Archives | Modify Your Subscription
signature.asc
Description: Message signed with OpenPGP using AMPGpg
smime.p7s
Description: S/MIME cryptographic signature
------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125 Powered by Listbox: https://www.listbox.com