[smartos-discuss] Server reboot related to persistent lwp-related bug?

Dr. Angelo Roussos Sun, 14 May 2017 02:20:59 -0700

 

Good morning,


 

One of our newer hosts rebooted earlier this morning - before this, the host
has been up for about 5 months and functioning well.

 

Details as to hardware and running image are below, but it is currently
hosting a relatively small number of guest instances - 80 guests (i.e. for
the hardware config of the host) - being a mix of KVM/OS/LX guests.

 

The host server is installed with all-SSDs, in a RAIDZ2 configuration,
spread equally across 2 vdevs.

 

We believe the main issue may be related to this - in SUMMARY:

 

------------------------------

 

017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot
after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58
occurred in module "genunix" due to a NULL pointer dereference

2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump
time: Sun May 14 03:26:31 2017

2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving
compressed system crash dump in /var/crash/volatile/vmdump.0

 

2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress
the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'

 

------------------------------

 

 

DETAILS ARE:

 

SYSINFO output:

 

[root@c10 ~]# sysinfo

{

  "Live Image": "20161013T025521Z",

  "System Type": "SunOS",

  "Boot Time": "1494737104",

  "SDC Version": "7.0",

  "Manufacturer": "Supermicro",

  "Product": "SYS-4048B-TR4FT",

  "Serial Number": "S186693X6715658",

  "SKU Number": "072615D9",

  "HW Version": "123456789",

  "HW Family": "SMC X10",

  "Setup": "false",

  "VM Capable": true,

  "CPU Type": "Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz",

  "CPU Virtualization": "vmx",

  "CPU Physical Cores": 4,

  "UUID": "00000000-0000-0000-0000-0cc47aa33714",

  "Hostname": "c10",

  "CPU Total Cores": 128,

  "MiB of Memory": "1572742",

  "Zpool": "zones",

  "Zpool Disks":
"c0t5002538C401EDE5Ed0,c0t5002538C401EDE64d0,c0t5002538C401EDE65d0,c0t500253
8C401EDE69d0,c0t5002538C401EE05Bd0,c0t5002538C4027D5C6d0,c0t5002538C402BD9EB
d0,c0t5002538C402BD9EEd0,c0t5002538C402BD9F5d0,c0t5002538C402BD9F9d0,c0t5002
538C402BD9FCd0,c0t5002538C402BDA00d0,c0t5002538C402BDAC6d0,c0t5002538C40599D
5Cd0,c0t5002538C40599EDEd0,c0t5002538C40599EE8d0",

  "Zpool Profile": "raidz2",

  "Zpool Creation": 1477475020,

  "Zpool Size in GiB": 10073,

  "Disks": {

    "c0t5002538C401EDE5Ed0": {"Size in GB": 960},

    "c0t5002538C401EDE64d0": {"Size in GB": 960},

    "c0t5002538C401EDE65d0": {"Size in GB": 960},

    "c0t5002538C401EDE69d0": {"Size in GB": 960},

    "c0t5002538C401EE05Bd0": {"Size in GB": 960},

    "c0t5002538C4027D5C6d0": {"Size in GB": 960},

    "c0t5002538C402BD9EBd0": {"Size in GB": 960},

    "c0t5002538C402BD9EEd0": {"Size in GB": 960},

    "c0t5002538C402BD9F5d0": {"Size in GB": 960},

    "c0t5002538C402BD9F9d0": {"Size in GB": 960},

    "c0t5002538C402BD9FCd0": {"Size in GB": 960},

    "c0t5002538C402BDA00d0": {"Size in GB": 960},

    "c0t5002538C402BDAC6d0": {"Size in GB": 960},

    "c0t5002538C40599D5Cd0": {"Size in GB": 960},

    "c0t5002538C40599EDEd0": {"Size in GB": 960},

    "c0t5002538C40599EE8d0": {"Size in GB": 960}

  },

  "Boot Parameters": {

    "smartos": "true",

    "root_shadow": "xxxxxxxxxxx"

  },

  "Network Interfaces": {

    "ixgbe0": {"MAC Address": "0c:c4:7a:a3:37:14", "ip4addr": "xxxxxxxxxxx",
"Link Status": "up", "NIC Names": ["admin"]},

    "ixgbe1": {"MAC Address": "0c:c4:7a:a3:37:15", "ip4addr": "", "Link
Status": "up", "NIC Names": []}

  },

  "Virtual Network Interfaces": {

  },

  "Link Aggregations": {

 

 

 

On initial login on c10, we see:

 

05:27:43 UTC 2017#012PLATFORM: SYS-4048B-TR4FT, CSN: S186693X6715658,
HOSTNAME: c10#012SOURCE: fmd-self-diagnosis, REV: 1.0#012EVENT-ID:
a916f701-ad5f-4c05-c134-8968a449677d#012DESC: A Solaris Fault Manager
component has experienced an error that required the module to be disabled.
Refer to http://illumos.org/msg/FMD-8000-2K for more
information.#012AUTO-RESPONSE: The module has been disabled.  Events
destined for the module will be saved for manual diagnosis.#012IMPACT:
Automated diagnosis and response for subsequent events associated with this
module will not occur.#012REC-ACTION: Use fmdump -v -u <EVENT-ID> to locate
the module.  Use fmadm reset <module> to reset the module.

 

 

[root@c10 /var/log]# svcs -xv svc:/site/default_init:default

svc:/site/default_init:default (Copy init to /etc/default/init)

State: maintenance since May 14, 2017 04:45:56 AM UTC

Reason: Start method failed repeatedly, last exited with status 2.

   See: http://illumos.org/msg/SMF-8000-KS

   See: /var/svc/log/site-default_init:default.log

Impact: This service is not running.

 

 

[root@c10 /var/log]# fmadm config

MODULE                   VERSION STATUS  DESCRIPTION

cpumem-retire            1.1     active  CPU/Memory Retire Agent

disk-lights              1.0     active  Disk Lights Agent

disk-transport           1.0     active  Disk Transport Agent

eft                      1.16    active  eft diagnosis engine

ext-event-transport      0.2     active  External FM event transport

fabric-xlate             1.0     active  Fabric Ereport Translater

fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis

io-retire                2.0     active  I/O Retire Agent

sensor-transport         1.1     active  Sensor Transport Agent

ses-log-transport        1.0     active  SES Log Transport Agent

software-diagnosis       0.1     failed  Software Diagnosis engine

software-response        0.1     active  Software Response Agent

sysevent-transport       1.0     active  SysEvent Transport Agent

syslog-msgs              1.1     active  Syslog Messaging Agent

zfs-diagnosis            1.0     active  ZFS Diagnosis Engine

zfs-retire               1.0     active  ZFS Retire Agent

 

So:

 

software-diagnosis       0.1     failed  Software Diagnosis engine

 

 

[root@c10 /var/log]# fmadm faulty

--------------- ------------------------------------  --------------
---------

TIME            EVENT-ID                              MSG-ID
SEVERITY

--------------- ------------------------------------  --------------
---------

May 14 05:27:43 a916f701-ad5f-4c05-c134-8968a449677d  FMD-8000-2K    Minor

 

Host        : c10

Platform    : SYS-4048B-TR4FT   Chassis_id  : S186693X6715658

Product_sn  :

 

Fault class : defect.sunos.fmd.module

Affects     : fmd:///module/software-diagnosis

                  faulted and taken out of service

FRU         : None

                  faulty

 

Description : A Solaris Fault Manager component has experienced an error
that

              required the module to be disabled.  Refer to

              http://illumos.org/msg/FMD-8000-2K for more information.

 

Response    : The module has been disabled.  Events destined for the module

              will be saved for manual diagnosis.

 

Impact      : Automated diagnosis and response for subsequent events
associated

              with this module will not occur.

 

Action      : Use fmdump -v -u <EVENT-ID> to locate the module.  Use fmadm

              reset <module> to reset the module.

 

 

So it looks like the fmadm message is NOT related to the primary cause of
the issue... event recorded AFTER reboot..

 

 

Looking at /var/log/auth.log:

 

2017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot
after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58
occurred in module "genunix" due to a NULL pointer dereference

2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump
time: Sun May 14 03:26:31 2017

2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving
compressed system crash dump in /var/crash/volatile/vmdump.0

 

2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress
the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'

 

 

Nothing in dmesg, EXCEPT zsched messages - as in below:

 

.....

2017-05-14T07:11:31.792661+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)

2017-05-14T07:11:31.792671+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)

2017-05-14T07:11:31.873184+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)

2017-05-14T07:11:31.873214+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)

2017-05-14T07:11:31.894092+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)

2017-05-14T07:11:31.894123+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)

2017-05-14T07:11:31.894133+00:00 c10 genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 41140 (zsched)

2017-05-14T07:11:39.760880+00:00 c10 mac: [ID 736570 kern.info] NOTICE:
vnic1272 unregistered

 

 

BUT zsched messages AND PANIC message in auth.log seem to both be related to
lwp issues that have been fixed:

 

https://github.com/joyent/illumos-joyent/issues/127

https://smartos.org/bugview/OS-4188

https://smartos.org/bugview/OS-5598

 

 

We have a core dump available - all 68GB (compressed) of it..

 

But it seems to us that the issue seems related to the above two 'fixed' (or
not?) issues? Or might this be new?

 

Are we going down the incorrect path?

 

Any feedback would be greatly appreciated,

 

Thanks,

 

Angelo.

 

 

 




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

[smartos-discuss] Server reboot related to persistent lwp-related bug?

Reply via email to