I'm not sure what sector size it uses. Randomly failing some devices suggests that the NVMe driver is not doing the right thing. I am not going to blame the hardware, because everything looks good under Solaris 11.3 and Centos.
Thanks! On Mon, Aug 1, 2016 at 2:39 PM, Michael Loftis <[email protected]> wrote: > One random question, are the affected SSDs LBA#3/4K sector formatted? > > On Mon, Aug 1, 2016 at 10:41 AM, Youzhong Yang <[email protected]> wrote: > >> Hello again, >> >> Thanks Robert for the advises. I've spent some time struggling with why >> NVMe SSDs were retired but there's no error reported by NVMe driver, it >> turns out to be a victim of fmd_asru_hash_replay_asru(), i.e. if we don't >> tell fmd a fault is repaired, next time when the host is rebooted, it tries >> to replay the event. >> >> I plugged in all the 24 NVMe SSDs, the driver reported errors like these >> (see attached txt file for additional info): >> >> 2016-07-30T23:11:53.468013-04:00 batfs9995 nvme: [ID 265585 kern.warning] >> WARNING: nvme3: command timeout, OPC = 6, CFS = 0 >> 2016-07-30T23:11:53.468018-04:00 batfs9995 nvme: [ID 265585 kern.warning] >> WARNING: nvme3: command timeout, OPC = 8, CFS = 0 >> 2016-07-30T23:11:53.468024-04:00 batfs9995 nvme: [ID 176450 kern.warning] >> WARNING: nvme3: nvme_admin_cmd failed for ABORT >> 2016-07-30T23:11:53.468032-04:00 batfs9995 nvme: [ID 366983 kern.warning] >> WARNING: nvme3: nvme_admin_cmd failed for IDENTIFY >> 2016-07-30T23:11:53.468038-04:00 batfs9995 nvme: [ID 318795 kern.warning] >> WARNING: nvme3: failed to identify controller >> 2016-07-30T23:11:53.468045-04:00 batfs9995 genunix: [ID 408114 kern.info] >> /pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0 >> (nvme3) down >> >> Here is my understanding of what happened after NVMe driver reported the >> above errors: >> >> - NVMe driver called ddi_fm_service_impact(nvme->n_dip, DDI_SERVICE_LOST) >> to report the error for device /pci@6d,0/pci8086,6f04@2/pci10b5,9765@0 >> /pci10b5,9765@7/pci8086,370a@0 >> >> - fmd received ereport.io.service.lost event with device-path = /pci@6d >> ,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0 >> >> - fmd decided the event affects the following devs: >> dev:////pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7 >> /pci8086,370a@0 >> dev:////pci@6d,0/pci8086,6f04@2/pci10b5,9765@0 >> dev:////pci@6d,0/pci8086,6f04@2 >> >> - fmd sent requests to retire the above devs, which caused all the SSDs >> under /pci@6d,0/pci8086,6f04@2 to be retired. >> >> Why fmd decides to retire the ancestors of the problematic device is a >> different issue, the issue here is why NVMe driver failed to execute some >> of its commands during nvme_attach(). Every time I rebooted the host, it >> just randomly failed some of the 24 devices, and rarely sometimes there's >> no error at all. >> >> Just an update about what I am up to, hopefully you guys can shed some >> light on what can be done next. >> >> Thanks, >> >> -Youzhong >> >> >> On Fri, Jun 24, 2016 at 8:13 PM, Robert Mustacchi <[email protected]> wrote: >> >>> On 6/24/16 11:05 , Youzhong Yang wrote: >>> >>> > I panicked the host when e_ddi_retire_device() is called, here is what >>> I >>> > found: >>> > >>> > it is /usr/lib/fm/fmd/fmd who calls modctl -> modctl_retire >>> > -> e_ddi_retire_device to retire /pci@0,0/pci8086,6f08@3. >>> >>> Okay, this makes some amount of sense, we're seeing various FM ereports >>> being generated at a rate which causes us to eventually offline the >>> device. >>> >>> > Attached is a file with some entries produced by fmdump. It's weird >>> that >>> > sometimes I got those fm entries but sometimes the system generated >>> nothing >>> > but still retired the drives. >>> > >>> > I don't know how to interpret those entries, maybe someone on the list >>> can >>> > shed some light? >>> >>> So, these are errors that are based on the PCI express specification and >>> the various entries usually refer to parts of the advanced error >>> reporting capabilities. So, what I do here is I go through and look at >>> the correctable and uncorrectable error status members which correspond >>> to the registers. >>> >>> So the first one starting at line 11 indicates that a receive error was >>> encountered. Note that the entry that generated it is not the device, >>> but what seems like the non-transparent bridge. >>> >>> It's also worth calling out what the general ereports are talking about. >>> You'll note there are basically three different classes there: >>> >>> - ereport.io.pci.fabric >>> - ereport.io.pciex.rc.ce-msg >>> - ereport.io.pciex.pl.re >>> >>> So, the pl.re are issues that indicate receiver errors. Which if I'm >>> reading this correctly indicates issues in some of the decoding of data? >>> >>> The rc.ce-msg means that the root complex has been informed of >>> correctable errors. >>> >>> That said, some of the messages that have arrive at the root port seem a >>> bit odd. >>> >>> > Device 8086:6f08 is "Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 >>> > v4/Xeon D PCI Express Root Port 3" and seems to use "PCIe bridge/switch >>> > driver" (pcieb). Is it possible the pcieb driver in illumos does not >>> work >>> > properly with this device? >>> >>> It looks like the actual NVMe devices may be connected to a >>> non-transparent bridge. So it's highly likely that that device is >>> failing which is also what's directly connected to that port. I have >>> seen something similar, but not on a system we have at Joyent. >>> >>> I'm going to have to spend a bit more time understanding the exact set >>> of FM actions that have caused us to end up deciding to offline that, >>> but in the interim, I'd suggest that we go through and see if this is >>> correlated at all with activity to the NVMe devices. While I'm not sure >>> that I have any reason to believe that the NVMe driver is at issue, it >>> might be a useful data point. >>> >>> First, what I'd suggest is that you use dtrace -Z here. -Z basically >>> tells DTrace to ignore probes that don't exist. That way when you run >>> add_drv on nvme, if it sees that the functions are in the nvme driver, >>> it'll end up enabling them. Then, make sure you kill DTrace before you >>> want to rem_drv, otherwise it'll block it. >>> >>> Perhaps let's try something like: >>> >>> dtrace -Zn 'fbt::pf_send_ereport:entry,fbt::nvme_submit_cmd:entry{ >>> trace(timestamp); }' -n 'fbt::nvme_wait_cmd:return{ trace(timestamp); >>> trace(arg1); }' >>> >>> Robert >>> >> >> > > > -- > > "Genius might be described as a supreme capacity for getting its possessors > into trouble of all kinds." > -- Samuel Butler > *smartos-discuss* | Archives > <https://www.listbox.com/member/archive/184463/=now> > <https://www.listbox.com/member/archive/rss/184463/25077300-734ee1ca> | > Modify > <https://www.listbox.com/member/?&> > Your Subscription <http://www.listbox.com> > ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
