Hello again,

Thanks Robert for the advises. I've spent some time struggling with why
NVMe SSDs were retired but there's no error reported by NVMe driver, it
turns out to be a victim of fmd_asru_hash_replay_asru(), i.e. if we don't
tell fmd a fault is repaired, next time when the host is rebooted, it tries
to replay the event.

I plugged in all the 24 NVMe SSDs, the driver reported errors like these
(see attached txt file for additional info):

2016-07-30T23:11:53.468013-04:00 batfs9995 nvme: [ID 265585 kern.warning]
WARNING: nvme3: command timeout, OPC = 6, CFS = 0
2016-07-30T23:11:53.468018-04:00 batfs9995 nvme: [ID 265585 kern.warning]
WARNING: nvme3: command timeout, OPC = 8, CFS = 0
2016-07-30T23:11:53.468024-04:00 batfs9995 nvme: [ID 176450 kern.warning]
WARNING: nvme3: nvme_admin_cmd failed for ABORT
2016-07-30T23:11:53.468032-04:00 batfs9995 nvme: [ID 366983 kern.warning]
WARNING: nvme3: nvme_admin_cmd failed for IDENTIFY
2016-07-30T23:11:53.468038-04:00 batfs9995 nvme: [ID 318795 kern.warning]
WARNING: nvme3: failed to identify controller
2016-07-30T23:11:53.468045-04:00 batfs9995 genunix: [ID 408114 kern.info]
/pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0
(nvme3) down

Here is my understanding of what happened after NVMe driver reported the
above errors:

- NVMe driver called ddi_fm_service_impact(nvme->n_dip, DDI_SERVICE_LOST)
to report the error for device /pci@6d,0/pci8086,6f04@2/pci10b5,9765@0
/pci10b5,9765@7/pci8086,370a@0

- fmd received ereport.io.service.lost event with device-path = /pci@6d
,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0

- fmd decided the event affects the following devs:
       dev:////pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7
/pci8086,370a@0
       dev:////pci@6d,0/pci8086,6f04@2/pci10b5,9765@0
       dev:////pci@6d,0/pci8086,6f04@2

- fmd sent requests to retire the above devs, which caused all the SSDs
under /pci@6d,0/pci8086,6f04@2 to be retired.

Why fmd decides to retire the ancestors of the problematic device is a
different issue, the issue here is why NVMe driver failed to execute some
of its commands during nvme_attach(). Every time I rebooted the host, it
just randomly failed some of the 24 devices, and rarely sometimes there's
no error at all.

Just an update about what I am up to, hopefully you guys can shed some
light on what can be done next.

Thanks,

-Youzhong


On Fri, Jun 24, 2016 at 8:13 PM, Robert Mustacchi <r...@joyent.com> wrote:

> On 6/24/16 11:05 , Youzhong Yang wrote:
>
> > I panicked the host when e_ddi_retire_device() is called, here is what I
> > found:
> >
> > it is /usr/lib/fm/fmd/fmd who calls modctl -> modctl_retire
> > -> e_ddi_retire_device to retire /pci@0,0/pci8086,6f08@3.
>
> Okay, this makes some amount of sense, we're seeing various FM ereports
> being generated at a rate which causes us to eventually offline the device.
>
> > Attached is a file with some entries produced by fmdump. It's weird that
> > sometimes I got those fm entries but sometimes the system generated
> nothing
> > but still retired the drives.
> >
> > I don't know how to interpret those entries, maybe someone on the list
> can
> > shed some light?
>
> So, these are errors that are based on the PCI express specification and
> the various entries usually refer to parts of the advanced error
> reporting capabilities. So, what I do here is I go through and look at
> the correctable and uncorrectable error status members which correspond
> to the registers.
>
> So the first one starting at line 11 indicates that a receive error was
> encountered. Note that the entry that generated it is not the device,
> but what seems like the non-transparent bridge.
>
> It's also worth calling out what the general ereports are talking about.
> You'll note there are basically three different classes there:
>
> - ereport.io.pci.fabric
> - ereport.io.pciex.rc.ce-msg
> - ereport.io.pciex.pl.re
>
> So, the pl.re are issues that indicate receiver errors. Which if I'm
> reading this correctly indicates issues in some of the decoding of data?
>
> The rc.ce-msg means that the root complex has been informed of
> correctable errors.
>
> That said, some of the messages that have arrive at the root port seem a
> bit odd.
>
> > Device 8086:6f08 is "Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3
> > v4/Xeon D PCI Express Root Port 3" and seems to use "PCIe bridge/switch
> > driver" (pcieb). Is it possible the pcieb driver in illumos does not work
> > properly with this device?
> 
> It looks like the actual NVMe devices may be connected to a
> non-transparent bridge. So it's highly likely that that device is
> failing which is also what's directly connected to that port. I have
> seen something similar, but not on a system we have at Joyent.
> 
> I'm going to have to spend a bit more time understanding the exact set
> of FM actions that have caused us to end up deciding to offline that,
> but in the interim, I'd suggest that we go through and see if this is
> correlated at all with activity to the NVMe devices. While I'm not sure
> that I have any reason to believe that the NVMe driver is at issue, it
> might be a useful data point.
> 
> First, what I'd suggest is that you use dtrace -Z here. -Z basically
> tells DTrace to ignore probes that don't exist. That way when you run
> add_drv on nvme, if it sees that the functions are in the nvme driver,
> it'll end up enabling them. Then, make sure you kill DTrace before you
> want to rem_drv, otherwise it'll block it.
> 
> Perhaps let's try something like:
> 
> dtrace -Zn 'fbt::pf_send_ereport:entry,fbt::nvme_submit_cmd:entry{
> trace(timestamp); }' -n 'fbt::nvme_wait_cmd:return{ trace(timestamp);
> trace(arg1); }'
> 
> Robert
> 

[root@batfs9995 ~]# fmdump -eV
Jul 30 2016 23:11:36.845004164 ereport.io.service.lost
nvlist version: 0
        class = ereport.io.service.lost
        ena = 0xbbc6e36fb1a00401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = 
/pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0
        (end detector)

        __ttl = 0x1
        __tod = 0x579d6c68 0x325dbd84
                
[root@batfs9995 ~]# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jul 30 23:12:06 0a9658e1-7ff4-eb83-aee2-874f8cd04d63  PCIEX-8000-0A  Critical

Host        : batfs9995
Platform    : SYS-2028U-TN24R4T+        Chassis_id  : S213283X6507518
Product_sn  :

Fault class : fault.io.pciex.device-interr
Affects     : 
dev:////pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0
              dev:////pci@6d,0/pci8086,6f04@2/pci10b5,9765@0
              dev:////pci@6d,0/pci8086,6f04@2
                  faulted and taken out of service
FRU         : "MB" 
(hc://:product-id=SYS-2028U-TN24R4T+:server-id=batfs9995:chassis-id=S213283X6507518/motherboard=0)
                  faulty

Description : A problem was detected for a PCIEX device.
              Refer to http://illumos.org/msg/PCIEX-8000-0A for more
              information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Schedule a repair procedure to replace the affected device.  Use
              fmadm faulty to identify the device or contact Sun for support.

2016-07-30T23:11:53.468013-04:00 batfs9995 nvme: [ID 265585 kern.warning] 
WARNING: nvme3: command timeout, OPC = 6, CFS = 0
2016-07-30T23:11:53.468018-04:00 batfs9995 nvme: [ID 265585 kern.warning] 
WARNING: nvme3: command timeout, OPC = 8, CFS = 0
2016-07-30T23:11:53.468024-04:00 batfs9995 nvme: [ID 176450 kern.warning] 
WARNING: nvme3: nvme_admin_cmd failed for ABORT
2016-07-30T23:11:53.468032-04:00 batfs9995 nvme: [ID 366983 kern.warning] 
WARNING: nvme3: nvme_admin_cmd failed for IDENTIFY
2016-07-30T23:11:53.468038-04:00 batfs9995 nvme: [ID 318795 kern.warning] 
WARNING: nvme3: failed to identify controller
2016-07-30T23:11:53.468045-04:00 batfs9995 genunix: [ID 408114 kern.info] 
/pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0 (nvme3) 
down

2016-07-30T23:12:06.759009-04:00 batfs9995 genunix: [ID 913975 kern.notice] 
NOTICE: Retire device: found dip = ffffd1270479cd48, path = 
/pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0
2016-07-30T23:12:06.807523-04:00 batfs9995 genunix: [ID 913975 kern.notice] 
NOTICE: Retire device: found dip = ffffd127047667f8, path = 
/pci@6d,0/pci8086,6f04@2/pci10b5,9765@0
2016-07-30T23:12:06.821228-04:00 batfs9995 genunix: [ID 913975 kern.notice] 
NOTICE: Retire device: found dip = ffffd12704770000, path = 
/pci@6d,0/pci8086,6f04@2

-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to