That is strange — the same git repo that built that image was the one that was 
pushed to Gerrit and merged with master (i.e. it is the _same_ commit).  
Looking at the current master (as of a few minutes ago), the change doesn’t 
appear to be stepped on (it’s a very small change — it just moves the setting 
of a bit mask indicating the CPU has finished starting up to the last thing in 
the per-cpu startup thread (minus some diagnostic messages after startup and a 
call to thread_exit()).

Did you try doing a 'gmake clobber’ in your SmartOS repo before building?  
Unfortunately, incremental (even just rebuilding illumos-joyent) doesn’t always 
work and can sometimes cause strange behavior.

In the meantime, I’ll try doing a full build of the current master and 
installing it on my server here at home — it was very good at tripping the bug 
in OS-7079, so I’ll see if I can get it to hang (though it’ll take a bit to do 
a full build).


From: Youzhong Yang <youzh...@gmail.com>
Reply: Youzhong Yang <youzh...@gmail.com>
Date: August 13, 2018 at 12:27:52 AM
To: Jason King <jason.brian.k...@gmail.com>
Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org>
Subject:  Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common 
races itself  

So your image booted up. Interesting ... Maybe something else messed up your 
fix?
Anyway I am now building my image and see what I can get from ::cpustack.

On Mon, Aug 13, 2018 at 1:14 AM, Jason King <jason.brian.k...@gmail.com> wrote:
Doh.. the problems of it being late :) .. there should be a ‘public’ in there. 

Try

https://us-east.manta.joyent.com/jbk/public/OS-7079/platform-20180719T001516Z.iso


From: Youzhong Yang <youzh...@gmail.com>
Reply: Youzhong Yang <youzh...@gmail.com>
Date: August 13, 2018 at 12:12:52 AM

To: Jason King <jason.brian.k...@gmail.com>
Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org>
Subject:  Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common 
races itself

I got this:

{"code":"ResourceNotFound","message":"/jbk/OS-7079/platform-20180719T001516Z.iso
 does not exist"}

In our /etc/system, I have 
set pcplusmp:apic_panic_on_nmi=1
set apix:apic_panic_on_nmi=1

If I set them to 0, and boot with -k, a NMI should drop into kmdb, right? I 
will build an image now and test.


On Mon, Aug 13, 2018 at 1:04 AM, Jason King <jason.brian.k...@gmail.com> wrote:
There’s a couple of ways — you can boot -kd and set a breakpoint to set it.  
You can also set it in etc/system in the proto area when building an image.

If you want, I do have an image of 20180719 w/ OS-7079 applied and kmdb on NMI 
already set (you’d still want to boot -k)  — you can grab it at
https://us-east.manta.joyent.com/jbk/OS-7079/platform-20180719T001516Z.{iso,tgz,usb.bz2}

If you do, it’d be interesting to see ::cpustack on each core looks like.


From: Youzhong Yang <youzh...@gmail.com>
Reply: Youzhong Yang <youzh...@gmail.com>
Date: August 12, 2018 at 11:58:48 PM
To: Jason King <jason.brian.k...@gmail.com>
Cc: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org>
Subject:  Re: [smartos-discuss] still hang at boot - OS-7079 mp_startup_common 
races itself

I sent NMI, but it printed out a stack trace plus a message "no dump device" or 
something then rebooted. I tried -v on my old supermicro system, on the console 
I saw message about sd## devices, then it hung. The console still responded to 
keyboard, but just stayed that way forever.

What change is needed to drop into kmdb when the OS receives NMI?

On Mon, Aug 13, 2018 at 12:06 AM, Jason King <jason.brian.k...@gmail.com> wrote:
Was that with boot -v?  Are you able to send the system an NMI after it hangs 
(or get the boot -v output up to the hang)?

Prior to OS-7079, the system would start to startup the next CPU before it had 
completely finished initializing the ‘current’ CPU (which could deadlock 
depending on which CPU obtained a particular lock first), the change makes it 
wait until the current CPU is finished starting up before proceeding to the 
next CPU.

It’s certainly possible it could have revealed another bug — OS-7079 itself was 
introduced almost 10 years ago, but didn’t seem to be easy to trigger until 
recent CPUs.


From: Youzhong Yang <youzh...@gmail.com>
Reply: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org>
Date: August 12, 2018 at 10:46:05 PM
To: smartos-discuss@lists.smartos.org <smartos-discuss@lists.smartos.org>
Subject:  [smartos-discuss] still hang at boot - OS-7079 mp_startup_common 
races itself

Today I built a smartos image (with all git repos synced to master) and 
rebooted the host with that image. It hung after the banner message + one more 
line about power management or something.

Then I reverted OS-7079, built an image, rebooted, it worked perfectly.

So does it mean OS-7079 fixed one issue, but caused another? My host is an old 
Supermicro X8DAH, Intel(R) Xeon(R) CPU X5570  @ 2.93GHz. Tomorrow I will try on 
a new all NVMe system and see if it works.

Thanks.
smartos-discuss | Archives | Modify Your Subscription           


Attachment: signature.asc
Description: Message signed with OpenPGP using AMPGpg

Attachment: smime.p7s
Description: S/MIME cryptographic signature




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125
Powered by Listbox: https://www.listbox.com

Reply via email to