Re: [smartos-discuss] Server reboot related to persistent lwp-related bug?

Adam Richmond-Gordon Mon, 15 May 2017 00:33:56 -0700

No problem! Hopefully that'll fix it.

Yes, feel free :)


Sent from my iPhone

> On 15 May 2017, at 06:54, Dr. Angelo Roussos <[email protected]> wrote:
> 
> Thanks very much Adam.
> 
> We'll schedule an upgrade based in your recommendations.
> 
> If you don't mind, I'll email you off-list related to this but outside the 
> scope of the SmartOS list..
> 
>> On 14 May 2017, at 12:43, Adam Richmond-Gordon <[email protected]> 
>> wrote:
>> 
>> Hello,
>> 
>> We had similar crashes a little while back, also on an X10 board with a v4 
>> chip.
>> 
>> Some v4 chips have a microcode bug which can be rectified by installing the 
>> latest BIOS/firmware from Supermicro - if you’re running the X10QBi board 
>> (which it seems you are), there was a new BIOS/firmware available in 
>> October. KVM triggers this microcode bug.
>> 
>> The LX bug OS-5598 was initially raised by me on IRC (then GitHub), we 
>> didn’t suffer issues with that after updating to the platform image that is 
>> older than the one you are currently running (though updating your platform 
>> image might still be an idea).
>> 
>> It might be worth checking that you are running the latest BIOS/firmware, 
>> just to be sure you’re not hitting the v4 microcode bug before you spend too 
>> much time trying to track down what might be a red herring.
>> 
>> That said, I’m sure somebody at Joyent would still like to see the crash 
>> dump.
>> 
>> Adam
>> 
>>> On 14 May 2017, at 10:20, Dr. Angelo Roussos <[email protected]> wrote:
>>> 
>>>  
>>> Good morning,
>>>  
>>> One of our newer hosts rebooted earlier this morning - before this, the 
>>> host has been up for about 5 months and functioning well.
>>>  
>>> Details as to hardware and running image are below, but it is currently 
>>> hosting a relatively small number of guest instances - 80 guests (i.e. for 
>>> the hardware config of the host) - being a mix of KVM/OS/LX guests.
>>>  
>>> The host server is installed with all-SSDs, in a RAIDZ2 configuration, 
>>> spread equally across 2 vdevs.
>>>  
>>> We believe the main issue may be related to this - in SUMMARY:
>>>  
>>> ------------------------------
>>>  
>>> 017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot 
>>> after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58 
>>> occurred in module "genunix" due to a NULL pointer dereference
>>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System 
>>> dump time: Sun May 14 03:26:31 2017
>>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving 
>>> compressed system crash dump in /var/crash/volatile/vmdump.0
>>>  
>>> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress 
>>> the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'
>>>  
>>> ------------------------------
>>>  
>>>  
>>> DETAILS ARE:
>>>  
>>> SYSINFO output:
>>>  
>>> [root@c10 ~]# sysinfo
>>> {
>>>   "Live Image": "20161013T025521Z",
>>>   "System Type": "SunOS",
>>>   "Boot Time": "1494737104",
>>>   "SDC Version": "7.0",
>>>   "Manufacturer": "Supermicro",
>>>   "Product": "SYS-4048B-TR4FT",
>>>   "Serial Number": "S186693X6715658",
>>>   "SKU Number": "072615D9",
>>>   "HW Version": "123456789",
>>>   "HW Family": "SMC X10",
>>>   "Setup": "false",
>>>   "VM Capable": true,
>>>   "CPU Type": "Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz",
>>>   "CPU Virtualization": "vmx",
>>>   "CPU Physical Cores": 4,
>>>   "UUID": "00000000-0000-0000-0000-0cc47aa33714",
>>>   "Hostname": "c10",
>>>   "CPU Total Cores": 128,
>>>   "MiB of Memory": "1572742",
>>>   "Zpool": "zones",
>>>   "Zpool Disks": 
>>> "c0t5002538C401EDE5Ed0,c0t5002538C401EDE64d0,c0t5002538C401EDE65d0,c0t5002538C401EDE69d0,c0t5002538C401EE05Bd0,c0t5002538C4027D5C6d0,c0t5002538C402BD9EBd0,c0t5002538C402BD9EEd0,c0t5002538C402BD9F5d0,c0t5002538C402BD9F9d0,c0t5002538C402BD9FCd0,c0t5002538C402BDA00d0,c0t5002538C402BDAC6d0,c0t5002538C40599D5Cd0,c0t5002538C40599EDEd0,c0t5002538C40599EE8d0",
>>>   "Zpool Profile": "raidz2",
>>>   "Zpool Creation": 1477475020,
>>>   "Zpool Size in GiB": 10073,
>>>   "Disks": {
>>>     "c0t5002538C401EDE5Ed0": {"Size in GB": 960},
>>>     "c0t5002538C401EDE64d0": {"Size in GB": 960},
>>>     "c0t5002538C401EDE65d0": {"Size in GB": 960},
>>>     "c0t5002538C401EDE69d0": {"Size in GB": 960},
>>>     "c0t5002538C401EE05Bd0": {"Size in GB": 960},
>>>     "c0t5002538C4027D5C6d0": {"Size in GB": 960},
>>>     "c0t5002538C402BD9EBd0": {"Size in GB": 960},
>>>     "c0t5002538C402BD9EEd0": {"Size in GB": 960},
>>>     "c0t5002538C402BD9F5d0": {"Size in GB": 960},
>>>     "c0t5002538C402BD9F9d0": {"Size in GB": 960},
>>>     "c0t5002538C402BD9FCd0": {"Size in GB": 960},
>>>     "c0t5002538C402BDA00d0": {"Size in GB": 960},
>>>     "c0t5002538C402BDAC6d0": {"Size in GB": 960},
>>>     "c0t5002538C40599D5Cd0": {"Size in GB": 960},
>>>     "c0t5002538C40599EDEd0": {"Size in GB": 960},
>>>     "c0t5002538C40599EE8d0": {"Size in GB": 960}
>>>   },
>>>   "Boot Parameters": {
>>>     "smartos": "true",
>>>     "root_shadow": "xxxxxxxxxxx"
>>>   },
>>>   "Network Interfaces": {
>>>     "ixgbe0": {"MAC Address": "0c:c4:7a:a3:37:14", "ip4addr": 
>>> "xxxxxxxxxxx", "Link Status": "up", "NIC Names": ["admin"]},
>>>     "ixgbe1": {"MAC Address": "0c:c4:7a:a3:37:15", "ip4addr": "", "Link 
>>> Status": "up", "NIC Names": []}
>>>   },
>>>   "Virtual Network Interfaces": {
>>>   },
>>>   "Link Aggregations": {
>>>  
>>>  
>>>  
>>> On initial login on c10, we see:
>>>  
>>> 05:27:43 UTC 2017#012PLATFORM: SYS-4048B-TR4FT, CSN: S186693X6715658, 
>>> HOSTNAME: c10#012SOURCE: fmd-self-diagnosis, REV: 1.0#012EVENT-ID: 
>>> a916f701-ad5f-4c05-c134-8968a449677d#012DESC: A Solaris Fault Manager 
>>> component has experienced an error that required the module to be disabled. 
>>>  Refer to http://illumos.org/msg/FMD-8000-2K for more 
>>> information.#012AUTO-RESPONSE: The module has been disabled.  Events 
>>> destined for the module will be saved for manual diagnosis.#012IMPACT: 
>>> Automated diagnosis and response for subsequent events associated with this 
>>> module will not occur.#012REC-ACTION: Use fmdump -v -u <EVENT-ID> to locate 
>>> the module.  Use fmadm reset <module> to reset the module.
>>>  
>>>  
>>> [root@c10 /var/log]# svcs -xv svc:/site/default_init:default
>>> svc:/site/default_init:default (Copy init to /etc/default/init)
>>> State: maintenance since May 14, 2017 04:45:56 AM UTC
>>> Reason: Start method failed repeatedly, last exited with status 2.
>>>    See: http://illumos.org/msg/SMF-8000-KS
>>>    See: /var/svc/log/site-default_init:default.log
>>> Impact: This service is not running.
>>>  
>>>  
>>> [root@c10 /var/log]# fmadm config
>>> MODULE                   VERSION STATUS  DESCRIPTION
>>> cpumem-retire            1.1     active  CPU/Memory Retire Agent
>>> disk-lights              1.0     active  Disk Lights Agent
>>> disk-transport           1.0     active  Disk Transport Agent
>>> eft                      1.16    active  eft diagnosis engine
>>> ext-event-transport      0.2     active  External FM event transport
>>> fabric-xlate             1.0     active  Fabric Ereport Translater
>>> fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
>>> io-retire                2.0     active  I/O Retire Agent
>>> sensor-transport         1.1     active  Sensor Transport Agent
>>> ses-log-transport        1.0     active  SES Log Transport Agent
>>> software-diagnosis       0.1     failed  Software Diagnosis engine
>>> software-response        0.1     active  Software Response Agent
>>> sysevent-transport       1.0     active  SysEvent Transport Agent
>>> syslog-msgs              1.1     active  Syslog Messaging Agent
>>> zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
>>> zfs-retire               1.0     active  ZFS Retire Agent
>>>  
>>> So:
>>>  
>>> software-diagnosis       0.1     failed  Software Diagnosis engine
>>>  
>>>  
>>> [root@c10 /var/log]# fmadm faulty
>>> --------------- ------------------------------------  -------------- 
>>> ---------
>>> TIME            EVENT-ID                              MSG-ID         
>>> SEVERITY
>>> --------------- ------------------------------------  -------------- 
>>> ---------
>>> May 14 05:27:43 a916f701-ad5f-4c05-c134-8968a449677d  FMD-8000-2K    Minor
>>>  
>>> Host        : c10
>>> Platform    : SYS-4048B-TR4FT   Chassis_id  : S186693X6715658
>>> Product_sn  :
>>>  
>>> Fault class : defect.sunos.fmd.module
>>> Affects     : fmd:///module/software-diagnosis
>>>                   faulted and taken out of service
>>> FRU         : None
>>>                   faulty
>>>  
>>> Description : A Solaris Fault Manager component has experienced an error 
>>> that
>>>               required the module to be disabled.  Refer to
>>>               http://illumos.org/msg/FMD-8000-2K for more information.
>>>  
>>> Response    : The module has been disabled.  Events destined for the module
>>>               will be saved for manual diagnosis.
>>>  
>>> Impact      : Automated diagnosis and response for subsequent events 
>>> associated
>>>               with this module will not occur.
>>>  
>>> Action      : Use fmdump -v -u <EVENT-ID> to locate the module.  Use fmadm
>>>               reset <module> to reset the module.
>>>  
>>>  
>>> So it looks like the fmadm message is NOT related to the primary cause of 
>>> the issue... event recorded AFTER reboot..
>>>  
>>>  
>>> Looking at /var/log/auth.log:
>>>  
>>> 2017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] 
>>> reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 
>>> addr=58 occurred in module "genunix" due to a NULL pointer dereference
>>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System 
>>> dump time: Sun May 14 03:26:31 2017
>>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving 
>>> compressed system crash dump in /var/crash/volatile/vmdump.0
>>>  
>>> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress 
>>> the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'
>>>  
>>>  
>>> Nothing in dmesg, EXCEPT zsched messages - as in below:
>>>  
>>> .....
>>> 2017-05-14T07:11:31.792661+00:00 c10 genunix: [ID 470503 kern.warning] 
>>> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)
>>> 2017-05-14T07:11:31.792671+00:00 c10 genunix: [ID 470503 kern.warning] 
>>> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)
>>> 2017-05-14T07:11:31.873184+00:00 c10 genunix: [ID 470503 kern.warning] 
>>> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)
>>> 2017-05-14T07:11:31.873214+00:00 c10 genunix: [ID 470503 kern.warning] 
>>> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)
>>> 2017-05-14T07:11:31.894092+00:00 c10 genunix: [ID 470503 kern.warning] 
>>> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)
>>> 2017-05-14T07:11:31.894123+00:00 c10 genunix: [ID 470503 kern.warning] 
>>> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)
>>> 2017-05-14T07:11:31.894133+00:00 c10 genunix: [ID 470503 kern.warning] 
>>> WARNING: Sorry, no swap space to grow stack for pid 41140 (zsched)
>>> 2017-05-14T07:11:39.760880+00:00 c10 mac: [ID 736570 kern.info] NOTICE: 
>>> vnic1272 unregistered
>>>  
>>>  
>>> BUT zsched messages AND PANIC message in auth.log seem to both be related 
>>> to lwp issues that have been fixed:
>>>  
>>> https://github.com/joyent/illumos-joyent/issues/127
>>> https://smartos.org/bugview/OS-4188
>>> https://smartos.org/bugview/OS-5598
>>>  
>>>  
>>> We have a core dump available - all 68GB (compressed) of it..
>>>  
>>> But it seems to us that the issue seems related to the above two 'fixed' 
>>> (or not?) issues? Or might this be new?
>>>  
>>> Are we going down the incorrect path?
>>>  
>>> Any feedback would be greatly appreciated,
>>>  
>>> Thanks,
>>>  
>>> Angelo.
>>>  
>>>  
>>>  
>> 
> 
> smartos-discuss | Archives  | Modify Your Subscription         



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Re: [smartos-discuss] Server reboot related to persistent lwp-related bug?

Reply via email to