Thanks very much Adam. We'll schedule an upgrade based in your recommendations.
If you don't mind, I'll email you off-list related to this but outside the scope of the SmartOS list.. > On 14 May 2017, at 12:43, Adam Richmond-Gordon <[email protected]> > wrote: > > Hello, > > We had similar crashes a little while back, also on an X10 board with a v4 > chip. > > Some v4 chips have a microcode bug which can be rectified by installing the > latest BIOS/firmware from Supermicro - if you’re running the X10QBi board > (which it seems you are), there was a new BIOS/firmware available in October. > KVM triggers this microcode bug. > > The LX bug OS-5598 was initially raised by me on IRC (then GitHub), we didn’t > suffer issues with that after updating to the platform image that is older > than the one you are currently running (though updating your platform image > might still be an idea). > > It might be worth checking that you are running the latest BIOS/firmware, > just to be sure you’re not hitting the v4 microcode bug before you spend too > much time trying to track down what might be a red herring. > > That said, I’m sure somebody at Joyent would still like to see the crash dump. > > Adam > >> On 14 May 2017, at 10:20, Dr. Angelo Roussos <[email protected]> wrote: >> >> >> Good morning, >> >> One of our newer hosts rebooted earlier this morning - before this, the host >> has been up for about 5 months and functioning well. >> >> Details as to hardware and running image are below, but it is currently >> hosting a relatively small number of guest instances - 80 guests (i.e. for >> the hardware config of the host) - being a mix of KVM/OS/LX guests. >> >> The host server is installed with all-SSDs, in a RAIDZ2 configuration, >> spread equally across 2 vdevs. >> >> We believe the main issue may be related to this - in SUMMARY: >> >> ------------------------------ >> >> 017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot >> after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58 >> occurred in module "genunix" due to a NULL pointer dereference >> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump >> time: Sun May 14 03:26:31 2017 >> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving >> compressed system crash dump in /var/crash/volatile/vmdump.0 >> >> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress >> the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0' >> >> ------------------------------ >> >> >> DETAILS ARE: >> >> SYSINFO output: >> >> [root@c10 ~]# sysinfo >> { >> "Live Image": "20161013T025521Z", >> "System Type": "SunOS", >> "Boot Time": "1494737104", >> "SDC Version": "7.0", >> "Manufacturer": "Supermicro", >> "Product": "SYS-4048B-TR4FT", >> "Serial Number": "S186693X6715658", >> "SKU Number": "072615D9", >> "HW Version": "123456789", >> "HW Family": "SMC X10", >> "Setup": "false", >> "VM Capable": true, >> "CPU Type": "Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz", >> "CPU Virtualization": "vmx", >> "CPU Physical Cores": 4, >> "UUID": "00000000-0000-0000-0000-0cc47aa33714", >> "Hostname": "c10", >> "CPU Total Cores": 128, >> "MiB of Memory": "1572742", >> "Zpool": "zones", >> "Zpool Disks": >> "c0t5002538C401EDE5Ed0,c0t5002538C401EDE64d0,c0t5002538C401EDE65d0,c0t5002538C401EDE69d0,c0t5002538C401EE05Bd0,c0t5002538C4027D5C6d0,c0t5002538C402BD9EBd0,c0t5002538C402BD9EEd0,c0t5002538C402BD9F5d0,c0t5002538C402BD9F9d0,c0t5002538C402BD9FCd0,c0t5002538C402BDA00d0,c0t5002538C402BDAC6d0,c0t5002538C40599D5Cd0,c0t5002538C40599EDEd0,c0t5002538C40599EE8d0", >> "Zpool Profile": "raidz2", >> "Zpool Creation": 1477475020, >> "Zpool Size in GiB": 10073, >> "Disks": { >> "c0t5002538C401EDE5Ed0": {"Size in GB": 960}, >> "c0t5002538C401EDE64d0": {"Size in GB": 960}, >> "c0t5002538C401EDE65d0": {"Size in GB": 960}, >> "c0t5002538C401EDE69d0": {"Size in GB": 960}, >> "c0t5002538C401EE05Bd0": {"Size in GB": 960}, >> "c0t5002538C4027D5C6d0": {"Size in GB": 960}, >> "c0t5002538C402BD9EBd0": {"Size in GB": 960}, >> "c0t5002538C402BD9EEd0": {"Size in GB": 960}, >> "c0t5002538C402BD9F5d0": {"Size in GB": 960}, >> "c0t5002538C402BD9F9d0": {"Size in GB": 960}, >> "c0t5002538C402BD9FCd0": {"Size in GB": 960}, >> "c0t5002538C402BDA00d0": {"Size in GB": 960}, >> "c0t5002538C402BDAC6d0": {"Size in GB": 960}, >> "c0t5002538C40599D5Cd0": {"Size in GB": 960}, >> "c0t5002538C40599EDEd0": {"Size in GB": 960}, >> "c0t5002538C40599EE8d0": {"Size in GB": 960} >> }, >> "Boot Parameters": { >> "smartos": "true", >> "root_shadow": "xxxxxxxxxxx" >> }, >> "Network Interfaces": { >> "ixgbe0": {"MAC Address": "0c:c4:7a:a3:37:14", "ip4addr": "xxxxxxxxxxx", >> "Link Status": "up", "NIC Names": ["admin"]}, >> "ixgbe1": {"MAC Address": "0c:c4:7a:a3:37:15", "ip4addr": "", "Link >> Status": "up", "NIC Names": []} >> }, >> "Virtual Network Interfaces": { >> }, >> "Link Aggregations": { >> >> >> >> On initial login on c10, we see: >> >> 05:27:43 UTC 2017#012PLATFORM: SYS-4048B-TR4FT, CSN: S186693X6715658, >> HOSTNAME: c10#012SOURCE: fmd-self-diagnosis, REV: 1.0#012EVENT-ID: >> a916f701-ad5f-4c05-c134-8968a449677d#012DESC: A Solaris Fault Manager >> component has experienced an error that required the module to be disabled. >> Refer to http://illumos.org/msg/FMD-8000-2K for more >> information.#012AUTO-RESPONSE: The module has been disabled. Events >> destined for the module will be saved for manual diagnosis.#012IMPACT: >> Automated diagnosis and response for subsequent events associated with this >> module will not occur.#012REC-ACTION: Use fmdump -v -u <EVENT-ID> to locate >> the module. Use fmadm reset <module> to reset the module. >> >> >> [root@c10 /var/log]# svcs -xv svc:/site/default_init:default >> svc:/site/default_init:default (Copy init to /etc/default/init) >> State: maintenance since May 14, 2017 04:45:56 AM UTC >> Reason: Start method failed repeatedly, last exited with status 2. >> See: http://illumos.org/msg/SMF-8000-KS >> See: /var/svc/log/site-default_init:default.log >> Impact: This service is not running. >> >> >> [root@c10 /var/log]# fmadm config >> MODULE VERSION STATUS DESCRIPTION >> cpumem-retire 1.1 active CPU/Memory Retire Agent >> disk-lights 1.0 active Disk Lights Agent >> disk-transport 1.0 active Disk Transport Agent >> eft 1.16 active eft diagnosis engine >> ext-event-transport 0.2 active External FM event transport >> fabric-xlate 1.0 active Fabric Ereport Translater >> fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis >> io-retire 2.0 active I/O Retire Agent >> sensor-transport 1.1 active Sensor Transport Agent >> ses-log-transport 1.0 active SES Log Transport Agent >> software-diagnosis 0.1 failed Software Diagnosis engine >> software-response 0.1 active Software Response Agent >> sysevent-transport 1.0 active SysEvent Transport Agent >> syslog-msgs 1.1 active Syslog Messaging Agent >> zfs-diagnosis 1.0 active ZFS Diagnosis Engine >> zfs-retire 1.0 active ZFS Retire Agent >> >> So: >> >> software-diagnosis 0.1 failed Software Diagnosis engine >> >> >> [root@c10 /var/log]# fmadm faulty >> --------------- ------------------------------------ -------------- >> --------- >> TIME EVENT-ID MSG-ID SEVERITY >> --------------- ------------------------------------ -------------- >> --------- >> May 14 05:27:43 a916f701-ad5f-4c05-c134-8968a449677d FMD-8000-2K Minor >> >> Host : c10 >> Platform : SYS-4048B-TR4FT Chassis_id : S186693X6715658 >> Product_sn : >> >> Fault class : defect.sunos.fmd.module >> Affects : fmd:///module/software-diagnosis >> faulted and taken out of service >> FRU : None >> faulty >> >> Description : A Solaris Fault Manager component has experienced an error that >> required the module to be disabled. Refer to >> http://illumos.org/msg/FMD-8000-2K for more information. >> >> Response : The module has been disabled. Events destined for the module >> will be saved for manual diagnosis. >> >> Impact : Automated diagnosis and response for subsequent events >> associated >> with this module will not occur. >> >> Action : Use fmdump -v -u <EVENT-ID> to locate the module. Use fmadm >> reset <module> to reset the module. >> >> >> So it looks like the fmadm message is NOT related to the primary cause of >> the issue... event recorded AFTER reboot.. >> >> >> Looking at /var/log/auth.log: >> >> 2017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot >> after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58 >> occurred in module "genunix" due to a NULL pointer dereference >> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump >> time: Sun May 14 03:26:31 2017 >> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving >> compressed system crash dump in /var/crash/volatile/vmdump.0 >> >> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress >> the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0' >> >> >> Nothing in dmesg, EXCEPT zsched messages - as in below: >> >> ..... >> 2017-05-14T07:11:31.792661+00:00 c10 genunix: [ID 470503 kern.warning] >> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched) >> 2017-05-14T07:11:31.792671+00:00 c10 genunix: [ID 470503 kern.warning] >> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched) >> 2017-05-14T07:11:31.873184+00:00 c10 genunix: [ID 470503 kern.warning] >> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched) >> 2017-05-14T07:11:31.873214+00:00 c10 genunix: [ID 470503 kern.warning] >> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched) >> 2017-05-14T07:11:31.894092+00:00 c10 genunix: [ID 470503 kern.warning] >> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched) >> 2017-05-14T07:11:31.894123+00:00 c10 genunix: [ID 470503 kern.warning] >> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched) >> 2017-05-14T07:11:31.894133+00:00 c10 genunix: [ID 470503 kern.warning] >> WARNING: Sorry, no swap space to grow stack for pid 41140 (zsched) >> 2017-05-14T07:11:39.760880+00:00 c10 mac: [ID 736570 kern.info] NOTICE: >> vnic1272 unregistered >> >> >> BUT zsched messages AND PANIC message in auth.log seem to both be related to >> lwp issues that have been fixed: >> >> https://github.com/joyent/illumos-joyent/issues/127 >> https://smartos.org/bugview/OS-4188 >> https://smartos.org/bugview/OS-5598 >> >> >> We have a core dump available - all 68GB (compressed) of it.. >> >> But it seems to us that the issue seems related to the above two 'fixed' (or >> not?) issues? Or might this be new? >> >> Are we going down the incorrect path? >> >> Any feedback would be greatly appreciated, >> >> Thanks, >> >> Angelo. >> >> >> > > smartos-discuss | Archives | Modify Your Subscription ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
