Good morning,
One of our newer hosts rebooted earlier this morning - before this, the host
has been up for about 5 months and functioning well.
Details as to hardware and running image are below, but it is currently
hosting a relatively small number of guest instances - 80 guests (i.e. for
the hardware config of the host) - being a mix of KVM/OS/LX guests.
The host server is installed with all-SSDs, in a RAIDZ2 configuration,
spread equally across 2 vdevs.
We believe the main issue may be related to this - in SUMMARY:
------------------------------
017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot
after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58
occurred in module "genunix" due to a NULL pointer dereference
2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump
time: Sun May 14 03:26:31 2017
2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving
compressed system crash dump in /var/crash/volatile/vmdump.0
2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress
the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'
------------------------------
DETAILS ARE:
SYSINFO output:
[root@c10 ~]# sysinfo
{
"Live Image": "20161013T025521Z",
"System Type": "SunOS",
"Boot Time": "1494737104",
"SDC Version": "7.0",
"Manufacturer": "Supermicro",
"Product": "SYS-4048B-TR4FT",
"Serial Number": "S186693X6715658",
"SKU Number": "072615D9",
"HW Version": "123456789",
"HW Family": "SMC X10",
"Setup": "false",
"VM Capable": true,
"CPU Type": "Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz",
"CPU Virtualization": "vmx",
"CPU Physical Cores": 4,
"UUID": "00000000-0000-0000-0000-0cc47aa33714",
"Hostname": "c10",
"CPU Total Cores": 128,
"MiB of Memory": "1572742",
"Zpool": "zones",
"Zpool Disks":
"c0t5002538C401EDE5Ed0,c0t5002538C401EDE64d0,c0t5002538C401EDE65d0,c0t500253
8C401EDE69d0,c0t5002538C401EE05Bd0,c0t5002538C4027D5C6d0,c0t5002538C402BD9EB
d0,c0t5002538C402BD9EEd0,c0t5002538C402BD9F5d0,c0t5002538C402BD9F9d0,c0t5002
538C402BD9FCd0,c0t5002538C402BDA00d0,c0t5002538C402BDAC6d0,c0t5002538C40599D
5Cd0,c0t5002538C40599EDEd0,c0t5002538C40599EE8d0",
"Zpool Profile": "raidz2",
"Zpool Creation": 1477475020,
"Zpool Size in GiB": 10073,
"Disks": {
"c0t5002538C401EDE5Ed0": {"Size in GB": 960},
"c0t5002538C401EDE64d0": {"Size in GB": 960},
"c0t5002538C401EDE65d0": {"Size in GB": 960},
"c0t5002538C401EDE69d0": {"Size in GB": 960},
"c0t5002538C401EE05Bd0": {"Size in GB": 960},
"c0t5002538C4027D5C6d0": {"Size in GB": 960},
"c0t5002538C402BD9EBd0": {"Size in GB": 960},
"c0t5002538C402BD9EEd0": {"Size in GB": 960},
"c0t5002538C402BD9F5d0": {"Size in GB": 960},
"c0t5002538C402BD9F9d0": {"Size in GB": 960},
"c0t5002538C402BD9FCd0": {"Size in GB": 960},
"c0t5002538C402BDA00d0": {"Size in GB": 960},
"c0t5002538C402BDAC6d0": {"Size in GB": 960},
"c0t5002538C40599D5Cd0": {"Size in GB": 960},
"c0t5002538C40599EDEd0": {"Size in GB": 960},
"c0t5002538C40599EE8d0": {"Size in GB": 960}
},
"Boot Parameters": {
"smartos": "true",
"root_shadow": "xxxxxxxxxxx"
},
"Network Interfaces": {
"ixgbe0": {"MAC Address": "0c:c4:7a:a3:37:14", "ip4addr": "xxxxxxxxxxx",
"Link Status": "up", "NIC Names": ["admin"]},
"ixgbe1": {"MAC Address": "0c:c4:7a:a3:37:15", "ip4addr": "", "Link
Status": "up", "NIC Names": []}
},
"Virtual Network Interfaces": {
},
"Link Aggregations": {
On initial login on c10, we see:
05:27:43 UTC 2017#012PLATFORM: SYS-4048B-TR4FT, CSN: S186693X6715658,
HOSTNAME: c10#012SOURCE: fmd-self-diagnosis, REV: 1.0#012EVENT-ID:
a916f701-ad5f-4c05-c134-8968a449677d#012DESC: A Solaris Fault Manager
component has experienced an error that required the module to be disabled.
Refer to http://illumos.org/msg/FMD-8000-2K for more
information.#012AUTO-RESPONSE: The module has been disabled. Events
destined for the module will be saved for manual diagnosis.#012IMPACT:
Automated diagnosis and response for subsequent events associated with this
module will not occur.#012REC-ACTION: Use fmdump -v -u <EVENT-ID> to locate
the module. Use fmadm reset <module> to reset the module.
[root@c10 /var/log]# svcs -xv svc:/site/default_init:default
svc:/site/default_init:default (Copy init to /etc/default/init)
State: maintenance since May 14, 2017 04:45:56 AM UTC
Reason: Start method failed repeatedly, last exited with status 2.
See: http://illumos.org/msg/SMF-8000-KS
See: /var/svc/log/site-default_init:default.log
Impact: This service is not running.
[root@c10 /var/log]# fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-retire 1.1 active CPU/Memory Retire Agent
disk-lights 1.0 active Disk Lights Agent
disk-transport 1.0 active Disk Transport Agent
eft 1.16 active eft diagnosis engine
ext-event-transport 0.2 active External FM event transport
fabric-xlate 1.0 active Fabric Ereport Translater
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 2.0 active I/O Retire Agent
sensor-transport 1.1 active Sensor Transport Agent
ses-log-transport 1.0 active SES Log Transport Agent
software-diagnosis 0.1 failed Software Diagnosis engine
software-response 0.1 active Software Response Agent
sysevent-transport 1.0 active SysEvent Transport Agent
syslog-msgs 1.1 active Syslog Messaging Agent
zfs-diagnosis 1.0 active ZFS Diagnosis Engine
zfs-retire 1.0 active ZFS Retire Agent
So:
software-diagnosis 0.1 failed Software Diagnosis engine
[root@c10 /var/log]# fmadm faulty
--------------- ------------------------------------ --------------
---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------ --------------
---------
May 14 05:27:43 a916f701-ad5f-4c05-c134-8968a449677d FMD-8000-2K Minor
Host : c10
Platform : SYS-4048B-TR4FT Chassis_id : S186693X6715658
Product_sn :
Fault class : defect.sunos.fmd.module
Affects : fmd:///module/software-diagnosis
faulted and taken out of service
FRU : None
faulty
Description : A Solaris Fault Manager component has experienced an error
that
required the module to be disabled. Refer to
http://illumos.org/msg/FMD-8000-2K for more information.
Response : The module has been disabled. Events destined for the module
will be saved for manual diagnosis.
Impact : Automated diagnosis and response for subsequent events
associated
with this module will not occur.
Action : Use fmdump -v -u <EVENT-ID> to locate the module. Use fmadm
reset <module> to reset the module.
So it looks like the fmadm message is NOT related to the primary cause of
the issue... event recorded AFTER reboot..
Looking at /var/log/auth.log:
2017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot
after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58
occurred in module "genunix" due to a NULL pointer dereference
2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump
time: Sun May 14 03:26:31 2017
2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving
compressed system crash dump in /var/crash/volatile/vmdump.0
2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress
the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'
Nothing in dmesg, EXCEPT zsched messages - as in below:
.....
2017-05-14T07:11:31.792661+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)
2017-05-14T07:11:31.792671+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)
2017-05-14T07:11:31.873184+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)
2017-05-14T07:11:31.873214+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)
2017-05-14T07:11:31.894092+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)
2017-05-14T07:11:31.894123+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)
2017-05-14T07:11:31.894133+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41140 (zsched)
2017-05-14T07:11:39.760880+00:00 c10 mac: [ID 736570 kern.info] NOTICE:
vnic1272 unregistered
BUT zsched messages AND PANIC message in auth.log seem to both be related to
lwp issues that have been fixed:
https://github.com/joyent/illumos-joyent/issues/127
https://smartos.org/bugview/OS-4188
https://smartos.org/bugview/OS-5598
We have a core dump available - all 68GB (compressed) of it..
But it seems to us that the issue seems related to the above two 'fixed' (or
not?) issues? Or might this be new?
Are we going down the incorrect path?
Any feedback would be greatly appreciated,
Thanks,
Angelo.
-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription:
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com