No problem! Hopefully that'll fix it. Yes, feel free :)
Sent from my iPhone > On 15 May 2017, at 06:54, Dr. Angelo Roussos <[email protected]> wrote: > > Thanks very much Adam. > > We'll schedule an upgrade based in your recommendations. > > If you don't mind, I'll email you off-list related to this but outside the > scope of the SmartOS list.. > >> On 14 May 2017, at 12:43, Adam Richmond-Gordon <[email protected]> >> wrote: >> >> Hello, >> >> We had similar crashes a little while back, also on an X10 board with a v4 >> chip. >> >> Some v4 chips have a microcode bug which can be rectified by installing the >> latest BIOS/firmware from Supermicro - if you’re running the X10QBi board >> (which it seems you are), there was a new BIOS/firmware available in >> October. KVM triggers this microcode bug. >> >> The LX bug OS-5598 was initially raised by me on IRC (then GitHub), we >> didn’t suffer issues with that after updating to the platform image that is >> older than the one you are currently running (though updating your platform >> image might still be an idea). >> >> It might be worth checking that you are running the latest BIOS/firmware, >> just to be sure you’re not hitting the v4 microcode bug before you spend too >> much time trying to track down what might be a red herring. >> >> That said, I’m sure somebody at Joyent would still like to see the crash >> dump. >> >> Adam >> >>> On 14 May 2017, at 10:20, Dr. Angelo Roussos <[email protected]> wrote: >>> >>> >>> Good morning, >>> >>> One of our newer hosts rebooted earlier this morning - before this, the >>> host has been up for about 5 months and functioning well. >>> >>> Details as to hardware and running image are below, but it is currently >>> hosting a relatively small number of guest instances - 80 guests (i.e. for >>> the hardware config of the host) - being a mix of KVM/OS/LX guests. >>> >>> The host server is installed with all-SSDs, in a RAIDZ2 configuration, >>> spread equally across 2 vdevs. >>> >>> We believe the main issue may be related to this - in SUMMARY: >>> >>> ------------------------------ >>> >>> 017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot >>> after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58 >>> occurred in module "genunix" due to a NULL pointer dereference >>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System >>> dump time: Sun May 14 03:26:31 2017 >>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving >>> compressed system crash dump in /var/crash/volatile/vmdump.0 >>> >>> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress >>> the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0' >>> >>> ------------------------------ >>> >>> >>> DETAILS ARE: >>> >>> SYSINFO output: >>> >>> [root@c10 ~]# sysinfo >>> { >>> "Live Image": "20161013T025521Z", >>> "System Type": "SunOS", >>> "Boot Time": "1494737104", >>> "SDC Version": "7.0", >>> "Manufacturer": "Supermicro", >>> "Product": "SYS-4048B-TR4FT", >>> "Serial Number": "S186693X6715658", >>> "SKU Number": "072615D9", >>> "HW Version": "123456789", >>> "HW Family": "SMC X10", >>> "Setup": "false", >>> "VM Capable": true, >>> "CPU Type": "Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz", >>> "CPU Virtualization": "vmx", >>> "CPU Physical Cores": 4, >>> "UUID": "00000000-0000-0000-0000-0cc47aa33714", >>> "Hostname": "c10", >>> "CPU Total Cores": 128, >>> "MiB of Memory": "1572742", >>> "Zpool": "zones", >>> "Zpool Disks": >>> "c0t5002538C401EDE5Ed0,c0t5002538C401EDE64d0,c0t5002538C401EDE65d0,c0t5002538C401EDE69d0,c0t5002538C401EE05Bd0,c0t5002538C4027D5C6d0,c0t5002538C402BD9EBd0,c0t5002538C402BD9EEd0,c0t5002538C402BD9F5d0,c0t5002538C402BD9F9d0,c0t5002538C402BD9FCd0,c0t5002538C402BDA00d0,c0t5002538C402BDAC6d0,c0t5002538C40599D5Cd0,c0t5002538C40599EDEd0,c0t5002538C40599EE8d0", >>> "Zpool Profile": "raidz2", >>> "Zpool Creation": 1477475020, >>> "Zpool Size in GiB": 10073, >>> "Disks": { >>> "c0t5002538C401EDE5Ed0": {"Size in GB": 960}, >>> "c0t5002538C401EDE64d0": {"Size in GB": 960}, >>> "c0t5002538C401EDE65d0": {"Size in GB": 960}, >>> "c0t5002538C401EDE69d0": {"Size in GB": 960}, >>> "c0t5002538C401EE05Bd0": {"Size in GB": 960}, >>> "c0t5002538C4027D5C6d0": {"Size in GB": 960}, >>> "c0t5002538C402BD9EBd0": {"Size in GB": 960}, >>> "c0t5002538C402BD9EEd0": {"Size in GB": 960}, >>> "c0t5002538C402BD9F5d0": {"Size in GB": 960}, >>> "c0t5002538C402BD9F9d0": {"Size in GB": 960}, >>> "c0t5002538C402BD9FCd0": {"Size in GB": 960}, >>> "c0t5002538C402BDA00d0": {"Size in GB": 960}, >>> "c0t5002538C402BDAC6d0": {"Size in GB": 960}, >>> "c0t5002538C40599D5Cd0": {"Size in GB": 960}, >>> "c0t5002538C40599EDEd0": {"Size in GB": 960}, >>> "c0t5002538C40599EE8d0": {"Size in GB": 960} >>> }, >>> "Boot Parameters": { >>> "smartos": "true", >>> "root_shadow": "xxxxxxxxxxx" >>> }, >>> "Network Interfaces": { >>> "ixgbe0": {"MAC Address": "0c:c4:7a:a3:37:14", "ip4addr": >>> "xxxxxxxxxxx", "Link Status": "up", "NIC Names": ["admin"]}, >>> "ixgbe1": {"MAC Address": "0c:c4:7a:a3:37:15", "ip4addr": "", "Link >>> Status": "up", "NIC Names": []} >>> }, >>> "Virtual Network Interfaces": { >>> }, >>> "Link Aggregations": { >>> >>> >>> >>> On initial login on c10, we see: >>> >>> 05:27:43 UTC 2017#012PLATFORM: SYS-4048B-TR4FT, CSN: S186693X6715658, >>> HOSTNAME: c10#012SOURCE: fmd-self-diagnosis, REV: 1.0#012EVENT-ID: >>> a916f701-ad5f-4c05-c134-8968a449677d#012DESC: A Solaris Fault Manager >>> component has experienced an error that required the module to be disabled. >>> Refer to http://illumos.org/msg/FMD-8000-2K for more >>> information.#012AUTO-RESPONSE: The module has been disabled. Events >>> destined for the module will be saved for manual diagnosis.#012IMPACT: >>> Automated diagnosis and response for subsequent events associated with this >>> module will not occur.#012REC-ACTION: Use fmdump -v -u <EVENT-ID> to locate >>> the module. Use fmadm reset <module> to reset the module. >>> >>> >>> [root@c10 /var/log]# svcs -xv svc:/site/default_init:default >>> svc:/site/default_init:default (Copy init to /etc/default/init) >>> State: maintenance since May 14, 2017 04:45:56 AM UTC >>> Reason: Start method failed repeatedly, last exited with status 2. >>> See: http://illumos.org/msg/SMF-8000-KS >>> See: /var/svc/log/site-default_init:default.log >>> Impact: This service is not running. >>> >>> >>> [root@c10 /var/log]# fmadm config >>> MODULE VERSION STATUS DESCRIPTION >>> cpumem-retire 1.1 active CPU/Memory Retire Agent >>> disk-lights 1.0 active Disk Lights Agent >>> disk-transport 1.0 active Disk Transport Agent >>> eft 1.16 active eft diagnosis engine >>> ext-event-transport 0.2 active External FM event transport >>> fabric-xlate 1.0 active Fabric Ereport Translater >>> fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis >>> io-retire 2.0 active I/O Retire Agent >>> sensor-transport 1.1 active Sensor Transport Agent >>> ses-log-transport 1.0 active SES Log Transport Agent >>> software-diagnosis 0.1 failed Software Diagnosis engine >>> software-response 0.1 active Software Response Agent >>> sysevent-transport 1.0 active SysEvent Transport Agent >>> syslog-msgs 1.1 active Syslog Messaging Agent >>> zfs-diagnosis 1.0 active ZFS Diagnosis Engine >>> zfs-retire 1.0 active ZFS Retire Agent >>> >>> So: >>> >>> software-diagnosis 0.1 failed Software Diagnosis engine >>> >>> >>> [root@c10 /var/log]# fmadm faulty >>> --------------- ------------------------------------ -------------- >>> --------- >>> TIME EVENT-ID MSG-ID >>> SEVERITY >>> --------------- ------------------------------------ -------------- >>> --------- >>> May 14 05:27:43 a916f701-ad5f-4c05-c134-8968a449677d FMD-8000-2K Minor >>> >>> Host : c10 >>> Platform : SYS-4048B-TR4FT Chassis_id : S186693X6715658 >>> Product_sn : >>> >>> Fault class : defect.sunos.fmd.module >>> Affects : fmd:///module/software-diagnosis >>> faulted and taken out of service >>> FRU : None >>> faulty >>> >>> Description : A Solaris Fault Manager component has experienced an error >>> that >>> required the module to be disabled. Refer to >>> http://illumos.org/msg/FMD-8000-2K for more information. >>> >>> Response : The module has been disabled. Events destined for the module >>> will be saved for manual diagnosis. >>> >>> Impact : Automated diagnosis and response for subsequent events >>> associated >>> with this module will not occur. >>> >>> Action : Use fmdump -v -u <EVENT-ID> to locate the module. Use fmadm >>> reset <module> to reset the module. >>> >>> >>> So it looks like the fmadm message is NOT related to the primary cause of >>> the issue... event recorded AFTER reboot.. >>> >>> >>> Looking at /var/log/auth.log: >>> >>> 2017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] >>> reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 >>> addr=58 occurred in module "genunix" due to a NULL pointer dereference >>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System >>> dump time: Sun May 14 03:26:31 2017 >>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving >>> compressed system crash dump in /var/crash/volatile/vmdump.0 >>> >>> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress >>> the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0' >>> >>> >>> Nothing in dmesg, EXCEPT zsched messages - as in below: >>> >>> ..... >>> 2017-05-14T07:11:31.792661+00:00 c10 genunix: [ID 470503 kern.warning] >>> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched) >>> 2017-05-14T07:11:31.792671+00:00 c10 genunix: [ID 470503 kern.warning] >>> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched) >>> 2017-05-14T07:11:31.873184+00:00 c10 genunix: [ID 470503 kern.warning] >>> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched) >>> 2017-05-14T07:11:31.873214+00:00 c10 genunix: [ID 470503 kern.warning] >>> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched) >>> 2017-05-14T07:11:31.894092+00:00 c10 genunix: [ID 470503 kern.warning] >>> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched) >>> 2017-05-14T07:11:31.894123+00:00 c10 genunix: [ID 470503 kern.warning] >>> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched) >>> 2017-05-14T07:11:31.894133+00:00 c10 genunix: [ID 470503 kern.warning] >>> WARNING: Sorry, no swap space to grow stack for pid 41140 (zsched) >>> 2017-05-14T07:11:39.760880+00:00 c10 mac: [ID 736570 kern.info] NOTICE: >>> vnic1272 unregistered >>> >>> >>> BUT zsched messages AND PANIC message in auth.log seem to both be related >>> to lwp issues that have been fixed: >>> >>> https://github.com/joyent/illumos-joyent/issues/127 >>> https://smartos.org/bugview/OS-4188 >>> https://smartos.org/bugview/OS-5598 >>> >>> >>> We have a core dump available - all 68GB (compressed) of it.. >>> >>> But it seems to us that the issue seems related to the above two 'fixed' >>> (or not?) issues? Or might this be new? >>> >>> Are we going down the incorrect path? >>> >>> Any feedback would be greatly appreciated, >>> >>> Thanks, >>> >>> Angelo. >>> >>> >>> >> > > smartos-discuss | Archives | Modify Your Subscription ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
