On 6/24/16 11:05 , Youzhong Yang wrote:
> I panicked the host when e_ddi_retire_device() is called, here is what I
> found:
>
> it is /usr/lib/fm/fmd/fmd who calls modctl -> modctl_retire
> -> e_ddi_retire_device to retire /pci@0,0/pci8086,6f08@3.
Okay, this makes some amount of sense, we're seeing various FM ereports
being generated at a rate which causes us to eventually offline the device.
> Attached is a file with some entries produced by fmdump. It's weird that
> sometimes I got those fm entries but sometimes the system generated nothing
> but still retired the drives.
>
> I don't know how to interpret those entries, maybe someone on the list can
> shed some light?
So, these are errors that are based on the PCI express specification and
the various entries usually refer to parts of the advanced error
reporting capabilities. So, what I do here is I go through and look at
the correctable and uncorrectable error status members which correspond
to the registers.
So the first one starting at line 11 indicates that a receive error was
encountered. Note that the entry that generated it is not the device,
but what seems like the non-transparent bridge.
It's also worth calling out what the general ereports are talking about.
You'll note there are basically three different classes there:
- ereport.io.pci.fabric
- ereport.io.pciex.rc.ce-msg
- ereport.io.pciex.pl.re
So, the pl.re are issues that indicate receiver errors. Which if I'm
reading this correctly indicates issues in some of the decoding of data?
The rc.ce-msg means that the root complex has been informed of
correctable errors.
That said, some of the messages that have arrive at the root port seem a
bit odd.
> Device 8086:6f08 is "Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3
> v4/Xeon D PCI Express Root Port 3" and seems to use "PCIe bridge/switch
> driver" (pcieb). Is it possible the pcieb driver in illumos does not work
> properly with this device?
It looks like the actual NVMe devices may be connected to a
non-transparent bridge. So it's highly likely that that device is
failing which is also what's directly connected to that port. I have
seen something similar, but not on a system we have at Joyent.
I'm going to have to spend a bit more time understanding the exact set
of FM actions that have caused us to end up deciding to offline that,
but in the interim, I'd suggest that we go through and see if this is
correlated at all with activity to the NVMe devices. While I'm not sure
that I have any reason to believe that the NVMe driver is at issue, it
might be a useful data point.
First, what I'd suggest is that you use dtrace -Z here. -Z basically
tells DTrace to ignore probes that don't exist. That way when you run
add_drv on nvme, if it sees that the functions are in the nvme driver,
it'll end up enabling them. Then, make sure you kill DTrace before you
want to rem_drv, otherwise it'll block it.
Perhaps let's try something like:
dtrace -Zn 'fbt::pf_send_ereport:entry,fbt::nvme_submit_cmd:entry{
trace(timestamp); }' -n 'fbt::nvme_wait_cmd:return{ trace(timestamp);
trace(arg1); }'
Robert
-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription:
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com