Thanks for the input Robert. I believe the issue is now resolved by using MSI-X (instead of FIXED) interrupt type inside nvme_init() for the admin queue.
Here is the issue report I just filed: https://www.illumos.org/issues/7273 I don't know why FIXED interrupt would cause issue, probably because we have too many NVMe SSDs? By the way, if you happen to have something in mind regarding where to start with for the issue of fmd (when one fails we end up failing the rest), please let me know. This is the issue that wasted me days of debugging time, and I would really like to have it resolved. Thanks, -Youzhong On Tue, Aug 2, 2016 at 1:07 AM, Robert Mustacchi <[email protected]> wrote: > On 8/1/16 10:41 , Youzhong Yang wrote: > > Hello again, > > > > Thanks Robert for the advises. I've spent some time struggling with why > > NVMe SSDs were retired but there's no error reported by NVMe driver, it > > turns out to be a victim of fmd_asru_hash_replay_asru(), i.e. if we don't > > tell fmd a fault is repaired, next time when the host is rebooted, it > tries > > to replay the event. > > > > I plugged in all the 24 NVMe SSDs, the driver reported errors like these > > (see attached txt file for additional info): > > > > 2016-07-30T23:11:53.468013-04:00 batfs9995 nvme: [ID 265585 > kern.warning] > > WARNING: nvme3: command timeout, OPC = 6, CFS = 0 > > 2016-07-30T23:11:53.468018-04:00 batfs9995 nvme: [ID 265585 kern.warning] > > WARNING: nvme3: command timeout, OPC = 8, CFS = 0 > > 2016-07-30T23:11:53.468024-04:00 batfs9995 nvme: [ID 176450 kern.warning] > > WARNING: nvme3: nvme_admin_cmd failed for ABORT > > 2016-07-30T23:11:53.468032-04:00 batfs9995 nvme: [ID 366983 kern.warning] > > WARNING: nvme3: nvme_admin_cmd failed for IDENTIFY > > 2016-07-30T23:11:53.468038-04:00 batfs9995 nvme: [ID 318795 kern.warning] > > WARNING: nvme3: failed to identify controller > > 2016-07-30T23:11:53.468045-04:00 batfs9995 genunix: [ID 408114 kern.info > ] > > /pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0 > > (nvme3) down > > > > Here is my understanding of what happened after NVMe driver reported the > > above errors: > > > > - NVMe driver called ddi_fm_service_impact(nvme->n_dip, DDI_SERVICE_LOST) > > to report the error for device /pci@6d,0/pci8086,6f04@2/pci10b5,9765@0 > > /pci10b5,9765@7/pci8086,370a@0 > > > > - fmd received ereport.io.service.lost event with device-path = /pci@6d > > ,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7/pci8086,370a@0 > > > > - fmd decided the event affects the following devs: > > dev:////pci@6d,0/pci8086,6f04@2/pci10b5,9765@0/pci10b5,9765@7 > > /pci8086,370a@0 > > dev:////pci@6d,0/pci8086,6f04@2/pci10b5,9765@0 > > dev:////pci@6d,0/pci8086,6f04@2 > > > > - fmd sent requests to retire the above devs, which caused all the SSDs > > under /pci@6d,0/pci8086,6f04@2 to be retired. > > > > Why fmd decides to retire the ancestors of the problematic device is a > > different issue, the issue here is why NVMe driver failed to execute some > > of its commands during nvme_attach(). Every time I rebooted the host, it > > just randomly failed some of the 24 devices, and rarely sometimes there's > > no error at all. > > > > Just an update about what I am up to, hopefully you guys can shed some > > light on what can be done next. > > Thanks for the detailed report. This is quite helpful. And yes, the fact > that when one fails we end up failing the rest doesn't make sense and > something we should look into. > > It sounds like you found one potential issue where we've gotten the > wrong size of a structure. So my next question would be when we have > these commands time out, are they always the same commands and are they > always when we're trying to attach and identify the device? If it's > always the same command, maybe we can use a little bit of DTrace to dump > what commands these are and what contents we're putting in them to make > sure they make sense. > > Please keep us posted and let me know if we can help further with this. > > Robert > > > On Fri, Jun 24, 2016 at 8:13 PM, Robert Mustacchi <[email protected]> wrote: > > > >> On 6/24/16 11:05 , Youzhong Yang wrote: > >> > >>> I panicked the host when e_ddi_retire_device() is called, here is what > I > >>> found: > >>> > >>> it is /usr/lib/fm/fmd/fmd who calls modctl -> modctl_retire > >>> -> e_ddi_retire_device to retire /pci@0,0/pci8086,6f08@3. > >> > >> Okay, this makes some amount of sense, we're seeing various FM ereports > >> being generated at a rate which causes us to eventually offline the > device. > >> > >>> Attached is a file with some entries produced by fmdump. It's weird > that > >>> sometimes I got those fm entries but sometimes the system generated > >> nothing > >>> but still retired the drives. > >>> > >>> I don't know how to interpret those entries, maybe someone on the list > >> can > >>> shed some light? > >> > >> So, these are errors that are based on the PCI express specification and > >> the various entries usually refer to parts of the advanced error > >> reporting capabilities. So, what I do here is I go through and look at > >> the correctable and uncorrectable error status members which correspond > >> to the registers. > >> > >> So the first one starting at line 11 indicates that a receive error was > >> encountered. Note that the entry that generated it is not the device, > >> but what seems like the non-transparent bridge. > >> > >> It's also worth calling out what the general ereports are talking about. > >> You'll note there are basically three different classes there: > >> > >> - ereport.io.pci.fabric > >> - ereport.io.pciex.rc.ce-msg > >> - ereport.io.pciex.pl.re > >> > >> So, the pl.re are issues that indicate receiver errors. Which if I'm > >> reading this correctly indicates issues in some of the decoding of data? > >> > >> The rc.ce-msg means that the root complex has been informed of > >> correctable errors. > >> > >> That said, some of the messages that have arrive at the root port seem a > >> bit odd. > >> > >>> Device 8086:6f08 is "Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 > >>> v4/Xeon D PCI Express Root Port 3" and seems to use "PCIe bridge/switch > >>> driver" (pcieb). Is it possible the pcieb driver in illumos does not > work > >>> properly with this device? > >> > >> It looks like the actual NVMe devices may be connected to a > >> non-transparent bridge. So it's highly likely that that device is > >> failing which is also what's directly connected to that port. I have > >> seen something similar, but not on a system we have at Joyent. > >> > >> I'm going to have to spend a bit more time understanding the exact set > >> of FM actions that have caused us to end up deciding to offline that, > >> but in the interim, I'd suggest that we go through and see if this is > >> correlated at all with activity to the NVMe devices. While I'm not sure > >> that I have any reason to believe that the NVMe driver is at issue, it > >> might be a useful data point. > >> > >> First, what I'd suggest is that you use dtrace -Z here. -Z basically > >> tells DTrace to ignore probes that don't exist. That way when you run > >> add_drv on nvme, if it sees that the functions are in the nvme driver, > >> it'll end up enabling them. Then, make sure you kill DTrace before you > >> want to rem_drv, otherwise it'll block it. > >> > >> Perhaps let's try something like: > >> > >> dtrace -Zn 'fbt::pf_send_ereport:entry,fbt::nvme_submit_cmd:entry{ > >> trace(timestamp); }' -n 'fbt::nvme_wait_cmd:return{ trace(timestamp); > >> trace(arg1); }' > >> > >> Robert > >> > > > > > > ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
