Hello,

We had similar crashes a little while back, also on an X10 board with a v4 chip.

Some v4 chips have a microcode bug which can be rectified by installing the 
latest BIOS/firmware from Supermicro - if you’re running the X10QBi board 
(which it seems you are), there was a new BIOS/firmware available in October. 
KVM triggers this microcode bug.

The LX bug OS-5598 was initially raised by me on IRC (then GitHub), we didn’t 
suffer issues with that after updating to the platform image that is older than 
the one you are currently running (though updating your platform image might 
still be an idea).

It might be worth checking that you are running the latest BIOS/firmware, just 
to be sure you’re not hitting the v4 microcode bug before you spend too much 
time trying to track down what might be a red herring.

That said, I’m sure somebody at Joyent would still like to see the crash dump.

Adam

> On 14 May 2017, at 10:20, Dr. Angelo Roussos <[email protected]> wrote:
> 
>  
> Good morning,
>  
> One of our newer hosts rebooted earlier this morning - before this, the host 
> has been up for about 5 months and functioning well.
>  
> Details as to hardware and running image are below, but it is currently 
> hosting a relatively small number of guest instances - 80 guests (i.e. for 
> the hardware config of the host) - being a mix of KVM/OS/LX guests.
>  
> The host server is installed with all-SSDs, in a RAIDZ2 configuration, spread 
> equally across 2 vdevs.
>  
> We believe the main issue may be related to this - in SUMMARY:
>  
> ------------------------------
>  
> 017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot 
> after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58 
> occurred in module "genunix" due to a NULL pointer dereference
> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump 
> time: Sun May 14 03:26:31 2017
> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving 
> compressed system crash dump in /var/crash/volatile/vmdump.0
>  
> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress the 
> crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'
>  
> ------------------------------
>  
>  
> DETAILS ARE:
>  
> SYSINFO output:
>  
> [root@c10 ~]# sysinfo
> {
>   "Live Image": "20161013T025521Z",
>   "System Type": "SunOS",
>   "Boot Time": "1494737104",
>   "SDC Version": "7.0",
>   "Manufacturer": "Supermicro",
>   "Product": "SYS-4048B-TR4FT",
>   "Serial Number": "S186693X6715658",
>   "SKU Number": "072615D9",
>   "HW Version": "123456789",
>   "HW Family": "SMC X10",
>   "Setup": "false",
>   "VM Capable": true,
>   "CPU Type": "Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz",
>   "CPU Virtualization": "vmx",
>   "CPU Physical Cores": 4,
>   "UUID": "00000000-0000-0000-0000-0cc47aa33714",
>   "Hostname": "c10",
>   "CPU Total Cores": 128,
>   "MiB of Memory": "1572742",
>   "Zpool": "zones",
>   "Zpool Disks": 
> "c0t5002538C401EDE5Ed0,c0t5002538C401EDE64d0,c0t5002538C401EDE65d0,c0t5002538C401EDE69d0,c0t5002538C401EE05Bd0,c0t5002538C4027D5C6d0,c0t5002538C402BD9EBd0,c0t5002538C402BD9EEd0,c0t5002538C402BD9F5d0,c0t5002538C402BD9F9d0,c0t5002538C402BD9FCd0,c0t5002538C402BDA00d0,c0t5002538C402BDAC6d0,c0t5002538C40599D5Cd0,c0t5002538C40599EDEd0,c0t5002538C40599EE8d0",
>   "Zpool Profile": "raidz2",
>   "Zpool Creation": 1477475020,
>   "Zpool Size in GiB": 10073,
>   "Disks": {
>     "c0t5002538C401EDE5Ed0": {"Size in GB": 960},
>     "c0t5002538C401EDE64d0": {"Size in GB": 960},
>     "c0t5002538C401EDE65d0": {"Size in GB": 960},
>     "c0t5002538C401EDE69d0": {"Size in GB": 960},
>     "c0t5002538C401EE05Bd0": {"Size in GB": 960},
>     "c0t5002538C4027D5C6d0": {"Size in GB": 960},
>     "c0t5002538C402BD9EBd0": {"Size in GB": 960},
>     "c0t5002538C402BD9EEd0": {"Size in GB": 960},
>     "c0t5002538C402BD9F5d0": {"Size in GB": 960},
>     "c0t5002538C402BD9F9d0": {"Size in GB": 960},
>     "c0t5002538C402BD9FCd0": {"Size in GB": 960},
>     "c0t5002538C402BDA00d0": {"Size in GB": 960},
>     "c0t5002538C402BDAC6d0": {"Size in GB": 960},
>     "c0t5002538C40599D5Cd0": {"Size in GB": 960},
>     "c0t5002538C40599EDEd0": {"Size in GB": 960},
>     "c0t5002538C40599EE8d0": {"Size in GB": 960}
>   },
>   "Boot Parameters": {
>     "smartos": "true",
>     "root_shadow": "xxxxxxxxxxx"
>   },
>   "Network Interfaces": {
>     "ixgbe0": {"MAC Address": "0c:c4:7a:a3:37:14", "ip4addr": "xxxxxxxxxxx", 
> "Link Status": "up", "NIC Names": ["admin"]},
>     "ixgbe1": {"MAC Address": "0c:c4:7a:a3:37:15", "ip4addr": "", "Link 
> Status": "up", "NIC Names": []}
>   },
>   "Virtual Network Interfaces": {
>   },
>   "Link Aggregations": {
>  
>  
>  
> On initial login on c10, we see:
>  
> 05:27:43 UTC 2017#012PLATFORM: SYS-4048B-TR4FT, CSN: S186693X6715658, 
> HOSTNAME: c10#012SOURCE: fmd-self-diagnosis, REV: 1.0#012EVENT-ID: 
> a916f701-ad5f-4c05-c134-8968a449677d#012DESC: A Solaris Fault Manager 
> component has experienced an error that required the module to be disabled.  
> Refer to http://illumos.org/msg/FMD-8000-2K 
> <http://illumos.org/msg/FMD-8000-2K> for more information.#012AUTO-RESPONSE: 
> The module has been disabled.  Events destined for the module will be saved 
> for manual diagnosis.#012IMPACT: Automated diagnosis and response for 
> subsequent events associated with this module will not occur.#012REC-ACTION: 
> Use fmdump -v -u <EVENT-ID> to locate the module.  Use fmadm reset <module> 
> to reset the module.
>  
>  
> [root@c10 /var/log]# svcs -xv svc:/site/default_init:default
> svc:/site/default_init:default (Copy init to /etc/default/init)
> State: maintenance since May 14, 2017 04:45:56 AM UTC
> Reason: Start method failed repeatedly, last exited with status 2.
>    See: http://illumos.org/msg/SMF-8000-KS 
> <http://illumos.org/msg/SMF-8000-KS>
>    See: /var/svc/log/site-default_init:default.log
> Impact: This service is not running.
>  
>  
> [root@c10 /var/log]# fmadm config
> MODULE                   VERSION STATUS  DESCRIPTION
> cpumem-retire            1.1     active  CPU/Memory Retire Agent
> disk-lights              1.0     active  Disk Lights Agent
> disk-transport           1.0     active  Disk Transport Agent
> eft                      1.16    active  eft diagnosis engine
> ext-event-transport      0.2     active  External FM event transport
> fabric-xlate             1.0     active  Fabric Ereport Translater
> fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
> io-retire                2.0     active  I/O Retire Agent
> sensor-transport         1.1     active  Sensor Transport Agent
> ses-log-transport        1.0     active  SES Log Transport Agent
> software-diagnosis       0.1     failed  Software Diagnosis engine
> software-response        0.1     active  Software Response Agent
> sysevent-transport       1.0     active  SysEvent Transport Agent
> syslog-msgs              1.1     active  Syslog Messaging Agent
> zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
> zfs-retire               1.0     active  ZFS Retire Agent
>  
> So:
>  
> software-diagnosis       0.1     failed  Software Diagnosis engine
>  
>  
> [root@c10 /var/log]# fmadm faulty
> --------------- ------------------------------------  -------------- ---------
> TIME            EVENT-ID                              MSG-ID         SEVERITY
> --------------- ------------------------------------  -------------- ---------
> May 14 05:27:43 a916f701-ad5f-4c05-c134-8968a449677d  FMD-8000-2K    Minor
>  
> Host        : c10
> Platform    : SYS-4048B-TR4FT   Chassis_id  : S186693X6715658
> Product_sn  :
>  
> Fault class : defect.sunos.fmd.module
> Affects     : fmd:///module/software-diagnosis 
> <fmd:///module/software-diagnosis>
>                   faulted and taken out of service
> FRU         : None
>                   faulty
>  
> Description : A Solaris Fault Manager component has experienced an error that
>               required the module to be disabled.  Refer to
>               http://illumos.org/msg/FMD-8000-2K 
> <http://illumos.org/msg/FMD-8000-2K> for more information.
>  
> Response    : The module has been disabled.  Events destined for the module
>               will be saved for manual diagnosis.
>  
> Impact      : Automated diagnosis and response for subsequent events 
> associated
>               with this module will not occur.
>  
> Action      : Use fmdump -v -u <EVENT-ID> to locate the module.  Use fmadm
>               reset <module> to reset the module.
>  
>  
> So it looks like the fmadm message is NOT related to the primary cause of the 
> issue... event recorded AFTER reboot..
>  
>  
> Looking at /var/log/auth.log:
>  
> 2017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot 
> after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58 
> occurred in module "genunix" due to a NULL pointer dereference
> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump 
> time: Sun May 14 03:26:31 2017
> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving 
> compressed system crash dump in /var/crash/volatile/vmdump.0
>  
> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress the 
> crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'
>  
>  
> Nothing in dmesg, EXCEPT zsched messages - as in below:
>  
> .....
> 2017-05-14T07:11:31.792661+00:00 c10 genunix: [ID 470503 kern.warning] 
> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)
> 2017-05-14T07:11:31.792671+00:00 c10 genunix: [ID 470503 kern.warning] 
> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)
> 2017-05-14T07:11:31.873184+00:00 c10 genunix: [ID 470503 kern.warning] 
> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)
> 2017-05-14T07:11:31.873214+00:00 c10 genunix: [ID 470503 kern.warning] 
> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)
> 2017-05-14T07:11:31.894092+00:00 c10 genunix: [ID 470503 kern.warning] 
> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)
> 2017-05-14T07:11:31.894123+00:00 c10 genunix: [ID 470503 kern.warning] 
> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)
> 2017-05-14T07:11:31.894133+00:00 c10 genunix: [ID 470503 kern.warning] 
> WARNING: Sorry, no swap space to grow stack for pid 41140 (zsched)
> 2017-05-14T07:11:39.760880+00:00 c10 mac: [ID 736570 kern.info 
> <http://kern.info/>] NOTICE: vnic1272 unregistered
>  
>  
> BUT zsched messages AND PANIC message in auth.log seem to both be related to 
> lwp issues that have been fixed:
>  
> https://github.com/joyent/illumos-joyent/issues/127 
> <https://github.com/joyent/illumos-joyent/issues/127>
> https://smartos.org/bugview/OS-4188 <https://smartos.org/bugview/OS-4188>
> https://smartos.org/bugview/OS-5598 <https://smartos.org/bugview/OS-5598>
>  
>  
> We have a core dump available - all 68GB (compressed) of it..
>  
> But it seems to us that the issue seems related to the above two 'fixed' (or 
> not?) issues? Or might this be new?
>  
> Are we going down the incorrect path?
>  
> Any feedback would be greatly appreciated,
>  
> Thanks,
>  
> Angelo.
>  
>  
>  
> smartos-discuss | Archives 
> <https://www.listbox.com/member/archive/184463/=now>  
> <https://www.listbox.com/member/archive/rss/184463/28443474-1732e24d> | 
> Modify <https://www.listbox.com/member/?&;> Your Subscription




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to