Sure thing. I'm attaching all of the logs I have to let you get a bigger picture (and anyone that might run into a similar issue). Hopefully I didn't mess anything up.
Unfortunately, I've seen almost every single device fail at one point or another. I was thinking it might be isolated to a single PLX Riser card but I have now seen devices fail on every single parent device at one time or another. Based on that, I don't think I could narrow it down to a single PCISlot/PLX Riser that is the culprit. Unless both of these boards are bad, my conclusion is that this indicates a problem with the hardware as well. I completely agree that if the PCI Bus reset isn't working properly, nothing is going to work. I sent these steps to the manufacturer to see if they could reproduce the issue on their end. If they can then they will need to investigate on their end why the problem exists. If they can't, it is possible we have a bad set of boards in this machine. Thank you so much for your help. Really appreciate it. -Kevin #: lspci -tv | +-1f.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU | \-1f.2 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU \-[0000:00]-+-00.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2 +-01.0-[01]-- +-02.0-[02-08]----00.0-[03-08]--+-00.0-[04]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | \-00.1 NVIDIA Corporation Device 0fb0 | +-04.0-[05]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | \-00.1 NVIDIA Corporation Device 0fb0 | +-08.0-[06]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | \-00.1 NVIDIA Corporation Device 0fb0 | +-0c.0-[07]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | \-00.1 NVIDIA Corporation Device 0fb0 | \-14.0-[08]----00.0 Mellanox Technologies MT27500 Family [ConnectX-3] +-03.0-[09-12]----00.0-[0a-12]--+-08.0-[0b-11]----00.0-[0c-11]--+-00.0-[0d]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | | \-00.1 NVIDIA Corporation Device 0fb0 | | +-04.0-[0e]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | | \-00.1 NVIDIA Corporation Device 0fb0 | | +-08.0-[0f]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | | \-00.1 NVIDIA Corporation Device 0fb0 | | +-0c.0-[10]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | | \-00.1 NVIDIA Corporation Device 0fb0 | | \-14.0-[11]--+-00.0 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.1 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.2 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.3 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.4 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.5 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.6 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | \-00.7 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | \-10.0-[12]-- +-05.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Address Map, VTd_Misc, System Management +-05.1 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug +-05.2 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 RAS, Control Status and Global Errors # showing which ones are in failed state :# lspci -vnnn | grep NVIDIA 04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff) 04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev ff) (prog-if ff) 05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff) 05:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev ff) (prog-if ff) 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device [10de:1132] 06:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1132] 07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device [10de:1132] 07:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1132] 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device [10de:1132] 0d:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1132] 0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device [10de:1132] 0e:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1132] 0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device [10de:1132] 0f:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1132] 10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device [10de:1132] 10:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1132] #showing parent bridge of a device that has a failed #:lspci -vvvs 03:00 03:00.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Bus: primary=03, secondary=04, subordinate=04, sec-latency=0 I/O behind bridge: 00009000-00009fff Memory behind bridge: c5000000-c60fffff Prefetchable memory behind bridge: 000038ffe0000000-000038fff1ffffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+ Address: 00000000fee003b8 Data: 0000 Masking: 000000ff Pending: 00000000 Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00 DevCap: MaxPayload 2048 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <4us, L1 <8us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Slot #32, PowerLimit 75.000W; Interlock- NoCompl- SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet+ LinkState- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Via message ARIFwd+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [a4] Subsystem: PLX Technology, Inc. Device 3577 Capabilities: [100 v1] Device Serial Number ab-87-00-10-b5-df-0e-00 Capabilities: [fb4 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 1f, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [138 v1] Power Budgeting <?> Capabilities: [10c v1] #19 Capabilities: [148 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=8 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=03 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64+ WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=WRR64 TC/VC=01 Status: NegoPending+ InProgress- Port Arbitration Table <?> Capabilities: [e00 v1] #12 Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 Len=010 <?> Kernel driver in use: pcieport #showing secondary device of 03:00 (parent) which is in failed state #: lspci -vvvs 04:00.0 04:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: pci-stub #showing secondary device of 03:00 (parent) of .1 device (audio adapter) that is in failed state #: lspci -vvvs 04:00.1 04:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: pci-stub #showing parent device that has a NON failed device #: lspci -vvvs 03:08 03:08.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Bus: primary=03, secondary=06, subordinate=06, sec-latency=0 I/O behind bridge: 00007000-00007fff Memory behind bridge: c1000000-c20fffff Prefetchable memory behind bridge: 000038ffa0000000-000038ffb1ffffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+ Address: 00000000fee003f8 Data: 0000 Masking: 000000ff Pending: 00000000 Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00 DevCap: MaxPayload 2048 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <4us, L1 <8us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Slot #32, PowerLimit 75.000W; Interlock- NoCompl- SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet- LinkState- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Via message ARIFwd+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [a4] Subsystem: PLX Technology, Inc. Device 3577 Capabilities: [100 v1] Device Serial Number ab-87-00-10-b5-df-0e-00 Capabilities: [fb4 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 1f, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [138 v1] Power Budgeting <?> Capabilities: [10c v1] #19 Capabilities: [148 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=8 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=03 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64+ WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=WRR64 TC/VC=01 Status: NegoPending- InProgress- Port Arbitration Table <?> Capabilities: [e00 v1] #12 Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 Len=010 <?> Kernel driver in use: pcieport #showing secondary device of 03:08 which is NON failed state #: lspci -vvvs 06:00.0 06:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device 1132 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 5 Region 0: Memory at c1000000 (32-bit, non-prefetchable) [disabled] [size=16M] Region 1: Memory at 38ffa0000000 (64-bit, prefetchable) [disabled] [size=256M] Region 3: Memory at 38ffb0000000 (64-bit, prefetchable) [disabled] [size=32M] Region 5: I/O ports at 7000 [disabled] [size=128] Expansion ROM at c2000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [250 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=255us PortTPowerOnTime=10us Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] #19 Kernel driver in use: pci-stub #showing secondary device of 03:08 of .1 device (audio adapter) that is in NON failed state #: lspci -vvvs 06:00.1 06:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) Subsystem: NVIDIA Corporation Device 1132 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin B routed to IRQ 3 Region 0: Memory at c2080000 (32-bit, non-prefetchable) [size=16K] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [78] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Kernel driver in use: pci-stub On Tue, Oct 18, 2016 at 6:03 PM, Alex Williamson <alex.william...@redhat.com > wrote: > On Tue, 18 Oct 2016 17:48:59 -0500 > Kevin Vasko <kva...@gmail.com> wrote: > > > Alex, > > > > I think I was able to do it successfully and was scucessfully able to > make > > the thing fail. It went from (rev a1) to (rev ff) with response of the > > header error. > > > > Instead of doing all devices I just did 1 at a time. > > > > this was the output of > > > > # lspci -tv > > > > +-02.0-[02-08]----00.0-[03-08]--+-00.0-[04]--+--00.0 NVIDIA Corporation > > GM200 [GeForce GTX TITAN X] > > | \-00.1 > > NVIDIA Corporation Device efb0 > > +-04.0-[05]--+--00.0 NVIDIA > > Corporation GM200 [GeForce GTX TITAN X] > > | \-00.1 > > NVIDIA Corporation Device efb0 > > +-08.0-[06]--+--00.0 NVIDIA > > Corporation GM200 [GeForce GTX TITAN X] > > | \-00.1 > > NVIDIA Corporation Device efb0 > > +-0c.0-[07]--+--00.0 NVIDIA > > Corporation GM200 [GeForce GTX TITAN X] > > | \-00.1 > > NVIDIA Corporation Device efb0 > > +-14.0-[08]----00.0 > Mellanox > > Technologies MT27600 Family [ConnectX-3] > > +-03.0-[09-12]----00.0-[0a-12]--+-08.0-[0b-11]----00.0-[0c- > 11]--+--00.0-[0d]--+-00.0 > > NVIDIA Corporation GM200 [GeForce GTX TITAN X] > > > > | \-00.1 NVIDIA Corporation Device 0fb0 > > > > +--04.0-[0e]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX > TITAN > > X] > > > > | \-00.1 NVIDIA Corporation Device 0fb0 > > > > +--08.0-[0f]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX > TITAN > > X] > > > > | \-00.1 NVIDIA Corporation Device 0fb0 > > > > +--0c.0-[10]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX > TITAN > > X] > > > > | \-00.1 NVIDIA Corporation Device 0fb0 > > > > I tried the first device > > # virsh nodedev-detach --driver=kvm pci_0000_04_00_0 > > Device pci_0000_04_00_0 detached > > > > # virsh nodedev-detach --driver=kvm pci_0000_04_00_1 > > Device pci_0000_04_00_1 detached > > > > In the script I put > > > > DEVS=( > > 03:00.0 > > 04 > > ) > > > > Ran it 100 times and got no error. > > > > Ran it for a different device 05 > > > > > > > > # virsh nodedev-detach --driver=kvm pci_0000_05_00_0 > > Device pci_0000_05_00_0 detached > > > > # virsh nodedev-detach --driver=kvm pci_0000_05_00_1 > > Device pci_0000_05_00_1 detached > > > > DEVS=( > > 03:04.0 > > 05: > > ) > > > > > > I saw this. > > > > #: for i in $(seq 1 100); do ./reset.sh; done > > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > > TITAN X] (rev a1) > > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) > > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > > TITAN X] (rev a1) > > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) > > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > > TITAN X] (rev ff) > > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff) > > > > I repeated this with another device on the system. > > > > I assume this indicates that that the device is not resetting properly? > The > > question is where do I go from here? Would this indicate a problem with > the > > PCI Reset code or a problematic hardware? > > Right, the PCIe link is not coming back for some reason, that seems > like a hardware issue. Can you attach the output of 'sudo lspci -vvvs > 3:04.0' when you're in this state (replace with the appropriate parent > bridge depending on the failed device), maybe we can see if that > downstream port is stuck in training. > > What I would do next is to test each card repeatedly. Do only some > cards fail? If so, swap a working card and a non-working card, does > the failure follow the card or the slot? I'm not sure what the result > is going to be, but if we can't rely on a PCI bus reset then you're > really not going to have any repeat-ability with assigning the GPUs. > Thanks, > > Alex >
_______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users