Thanks very much Adam.

We'll schedule an upgrade based in your recommendations.

If you don't mind, I'll email you off-list related to this but outside the 
scope of the SmartOS list..

> On 14 May 2017, at 12:43, Adam Richmond-Gordon <[email protected]> 
> wrote:
> 
> Hello,
> 
> We had similar crashes a little while back, also on an X10 board with a v4 
> chip.
> 
> Some v4 chips have a microcode bug which can be rectified by installing the 
> latest BIOS/firmware from Supermicro - if you’re running the X10QBi board 
> (which it seems you are), there was a new BIOS/firmware available in October. 
> KVM triggers this microcode bug.
> 
> The LX bug OS-5598 was initially raised by me on IRC (then GitHub), we didn’t 
> suffer issues with that after updating to the platform image that is older 
> than the one you are currently running (though updating your platform image 
> might still be an idea).
> 
> It might be worth checking that you are running the latest BIOS/firmware, 
> just to be sure you’re not hitting the v4 microcode bug before you spend too 
> much time trying to track down what might be a red herring.
> 
> That said, I’m sure somebody at Joyent would still like to see the crash dump.
> 
> Adam
> 
>> On 14 May 2017, at 10:20, Dr. Angelo Roussos <[email protected]> wrote:
>> 
>>  
>> Good morning,
>>  
>> One of our newer hosts rebooted earlier this morning - before this, the host 
>> has been up for about 5 months and functioning well.
>>  
>> Details as to hardware and running image are below, but it is currently 
>> hosting a relatively small number of guest instances - 80 guests (i.e. for 
>> the hardware config of the host) - being a mix of KVM/OS/LX guests.
>>  
>> The host server is installed with all-SSDs, in a RAIDZ2 configuration, 
>> spread equally across 2 vdevs.
>>  
>> We believe the main issue may be related to this - in SUMMARY:
>>  
>> ------------------------------
>>  
>> 017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot 
>> after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58 
>> occurred in module "genunix" due to a NULL pointer dereference
>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump 
>> time: Sun May 14 03:26:31 2017
>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving 
>> compressed system crash dump in /var/crash/volatile/vmdump.0
>>  
>> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress 
>> the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'
>>  
>> ------------------------------
>>  
>>  
>> DETAILS ARE:
>>  
>> SYSINFO output:
>>  
>> [root@c10 ~]# sysinfo
>> {
>>   "Live Image": "20161013T025521Z",
>>   "System Type": "SunOS",
>>   "Boot Time": "1494737104",
>>   "SDC Version": "7.0",
>>   "Manufacturer": "Supermicro",
>>   "Product": "SYS-4048B-TR4FT",
>>   "Serial Number": "S186693X6715658",
>>   "SKU Number": "072615D9",
>>   "HW Version": "123456789",
>>   "HW Family": "SMC X10",
>>   "Setup": "false",
>>   "VM Capable": true,
>>   "CPU Type": "Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz",
>>   "CPU Virtualization": "vmx",
>>   "CPU Physical Cores": 4,
>>   "UUID": "00000000-0000-0000-0000-0cc47aa33714",
>>   "Hostname": "c10",
>>   "CPU Total Cores": 128,
>>   "MiB of Memory": "1572742",
>>   "Zpool": "zones",
>>   "Zpool Disks": 
>> "c0t5002538C401EDE5Ed0,c0t5002538C401EDE64d0,c0t5002538C401EDE65d0,c0t5002538C401EDE69d0,c0t5002538C401EE05Bd0,c0t5002538C4027D5C6d0,c0t5002538C402BD9EBd0,c0t5002538C402BD9EEd0,c0t5002538C402BD9F5d0,c0t5002538C402BD9F9d0,c0t5002538C402BD9FCd0,c0t5002538C402BDA00d0,c0t5002538C402BDAC6d0,c0t5002538C40599D5Cd0,c0t5002538C40599EDEd0,c0t5002538C40599EE8d0",
>>   "Zpool Profile": "raidz2",
>>   "Zpool Creation": 1477475020,
>>   "Zpool Size in GiB": 10073,
>>   "Disks": {
>>     "c0t5002538C401EDE5Ed0": {"Size in GB": 960},
>>     "c0t5002538C401EDE64d0": {"Size in GB": 960},
>>     "c0t5002538C401EDE65d0": {"Size in GB": 960},
>>     "c0t5002538C401EDE69d0": {"Size in GB": 960},
>>     "c0t5002538C401EE05Bd0": {"Size in GB": 960},
>>     "c0t5002538C4027D5C6d0": {"Size in GB": 960},
>>     "c0t5002538C402BD9EBd0": {"Size in GB": 960},
>>     "c0t5002538C402BD9EEd0": {"Size in GB": 960},
>>     "c0t5002538C402BD9F5d0": {"Size in GB": 960},
>>     "c0t5002538C402BD9F9d0": {"Size in GB": 960},
>>     "c0t5002538C402BD9FCd0": {"Size in GB": 960},
>>     "c0t5002538C402BDA00d0": {"Size in GB": 960},
>>     "c0t5002538C402BDAC6d0": {"Size in GB": 960},
>>     "c0t5002538C40599D5Cd0": {"Size in GB": 960},
>>     "c0t5002538C40599EDEd0": {"Size in GB": 960},
>>     "c0t5002538C40599EE8d0": {"Size in GB": 960}
>>   },
>>   "Boot Parameters": {
>>     "smartos": "true",
>>     "root_shadow": "xxxxxxxxxxx"
>>   },
>>   "Network Interfaces": {
>>     "ixgbe0": {"MAC Address": "0c:c4:7a:a3:37:14", "ip4addr": "xxxxxxxxxxx", 
>> "Link Status": "up", "NIC Names": ["admin"]},
>>     "ixgbe1": {"MAC Address": "0c:c4:7a:a3:37:15", "ip4addr": "", "Link 
>> Status": "up", "NIC Names": []}
>>   },
>>   "Virtual Network Interfaces": {
>>   },
>>   "Link Aggregations": {
>>  
>>  
>>  
>> On initial login on c10, we see:
>>  
>> 05:27:43 UTC 2017#012PLATFORM: SYS-4048B-TR4FT, CSN: S186693X6715658, 
>> HOSTNAME: c10#012SOURCE: fmd-self-diagnosis, REV: 1.0#012EVENT-ID: 
>> a916f701-ad5f-4c05-c134-8968a449677d#012DESC: A Solaris Fault Manager 
>> component has experienced an error that required the module to be disabled.  
>> Refer to http://illumos.org/msg/FMD-8000-2K for more 
>> information.#012AUTO-RESPONSE: The module has been disabled.  Events 
>> destined for the module will be saved for manual diagnosis.#012IMPACT: 
>> Automated diagnosis and response for subsequent events associated with this 
>> module will not occur.#012REC-ACTION: Use fmdump -v -u <EVENT-ID> to locate 
>> the module.  Use fmadm reset <module> to reset the module.
>>  
>>  
>> [root@c10 /var/log]# svcs -xv svc:/site/default_init:default
>> svc:/site/default_init:default (Copy init to /etc/default/init)
>> State: maintenance since May 14, 2017 04:45:56 AM UTC
>> Reason: Start method failed repeatedly, last exited with status 2.
>>    See: http://illumos.org/msg/SMF-8000-KS
>>    See: /var/svc/log/site-default_init:default.log
>> Impact: This service is not running.
>>  
>>  
>> [root@c10 /var/log]# fmadm config
>> MODULE                   VERSION STATUS  DESCRIPTION
>> cpumem-retire            1.1     active  CPU/Memory Retire Agent
>> disk-lights              1.0     active  Disk Lights Agent
>> disk-transport           1.0     active  Disk Transport Agent
>> eft                      1.16    active  eft diagnosis engine
>> ext-event-transport      0.2     active  External FM event transport
>> fabric-xlate             1.0     active  Fabric Ereport Translater
>> fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
>> io-retire                2.0     active  I/O Retire Agent
>> sensor-transport         1.1     active  Sensor Transport Agent
>> ses-log-transport        1.0     active  SES Log Transport Agent
>> software-diagnosis       0.1     failed  Software Diagnosis engine
>> software-response        0.1     active  Software Response Agent
>> sysevent-transport       1.0     active  SysEvent Transport Agent
>> syslog-msgs              1.1     active  Syslog Messaging Agent
>> zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
>> zfs-retire               1.0     active  ZFS Retire Agent
>>  
>> So:
>>  
>> software-diagnosis       0.1     failed  Software Diagnosis engine
>>  
>>  
>> [root@c10 /var/log]# fmadm faulty
>> --------------- ------------------------------------  -------------- 
>> ---------
>> TIME            EVENT-ID                              MSG-ID         SEVERITY
>> --------------- ------------------------------------  -------------- 
>> ---------
>> May 14 05:27:43 a916f701-ad5f-4c05-c134-8968a449677d  FMD-8000-2K    Minor
>>  
>> Host        : c10
>> Platform    : SYS-4048B-TR4FT   Chassis_id  : S186693X6715658
>> Product_sn  :
>>  
>> Fault class : defect.sunos.fmd.module
>> Affects     : fmd:///module/software-diagnosis
>>                   faulted and taken out of service
>> FRU         : None
>>                   faulty
>>  
>> Description : A Solaris Fault Manager component has experienced an error that
>>               required the module to be disabled.  Refer to
>>               http://illumos.org/msg/FMD-8000-2K for more information.
>>  
>> Response    : The module has been disabled.  Events destined for the module
>>               will be saved for manual diagnosis.
>>  
>> Impact      : Automated diagnosis and response for subsequent events 
>> associated
>>               with this module will not occur.
>>  
>> Action      : Use fmdump -v -u <EVENT-ID> to locate the module.  Use fmadm
>>               reset <module> to reset the module.
>>  
>>  
>> So it looks like the fmadm message is NOT related to the primary cause of 
>> the issue... event recorded AFTER reboot..
>>  
>>  
>> Looking at /var/log/auth.log:
>>  
>> 2017-05-14T04:45:56.790224+00:00 c10 savecore: [ID 570001 auth.error] reboot 
>> after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffd00b88b0b3d0 addr=58 
>> occurred in module "genunix" due to a NULL pointer dereference
>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 850461 auth.warning] System dump 
>> time: Sun May 14 03:26:31 2017
>> 2017-05-14T04:45:45+00:00 c10 savecore: [ID 676874 auth.error] Saving 
>> compressed system crash dump in /var/crash/volatile/vmdump.0
>>  
>> 2017-05-14T05:27:41+00:00 c10 savecore: [ID 320429 auth.error] Decompress 
>> the crash dump with #012'savecore -vf /var/crash/volatile/vmdump.0'
>>  
>>  
>> Nothing in dmesg, EXCEPT zsched messages - as in below:
>>  
>> .....
>> 2017-05-14T07:11:31.792661+00:00 c10 genunix: [ID 470503 kern.warning] 
>> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)
>> 2017-05-14T07:11:31.792671+00:00 c10 genunix: [ID 470503 kern.warning] 
>> WARNING: Sorry, no swap space to grow stack for pid 41133 (zsched)
>> 2017-05-14T07:11:31.873184+00:00 c10 genunix: [ID 470503 kern.warning] 
>> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)
>> 2017-05-14T07:11:31.873214+00:00 c10 genunix: [ID 470503 kern.warning] 
>> WARNING: Sorry, no swap space to grow stack for pid 41137 (zsched)
>> 2017-05-14T07:11:31.894092+00:00 c10 genunix: [ID 470503 kern.warning] 
>> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)
>> 2017-05-14T07:11:31.894123+00:00 c10 genunix: [ID 470503 kern.warning] 
>> WARNING: Sorry, no swap space to grow stack for pid 41139 (zsched)
>> 2017-05-14T07:11:31.894133+00:00 c10 genunix: [ID 470503 kern.warning] 
>> WARNING: Sorry, no swap space to grow stack for pid 41140 (zsched)
>> 2017-05-14T07:11:39.760880+00:00 c10 mac: [ID 736570 kern.info] NOTICE: 
>> vnic1272 unregistered
>>  
>>  
>> BUT zsched messages AND PANIC message in auth.log seem to both be related to 
>> lwp issues that have been fixed:
>>  
>> https://github.com/joyent/illumos-joyent/issues/127
>> https://smartos.org/bugview/OS-4188
>> https://smartos.org/bugview/OS-5598
>>  
>>  
>> We have a core dump available - all 68GB (compressed) of it..
>>  
>> But it seems to us that the issue seems related to the above two 'fixed' (or 
>> not?) issues? Or might this be new?
>>  
>> Are we going down the incorrect path?
>>  
>> Any feedback would be greatly appreciated,
>>  
>> Thanks,
>>  
>> Angelo.
>>  
>>  
>>  
> 
> smartos-discuss | Archives  | Modify Your Subscription         



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to