Re: [vfio-users] Bus reset trouble with Titan-X
Thanks for the information. I didn't even noticed the Presence Detect Changed bit difference (granted that is mostly due not knowing what to look for and being a little over my head at this point). I wouldn't figure that there would be a difference in using a different card but at this point I'm out of things to try on my end. As for trying a non-NVIDIA card, we don't have any available that I'm aware of so wouldn't be able to test that unfortunately. I'm not very familiar with the PLX technology, and definitely not sure what the manufacturer might have done with this particular board (e.g. if this is a problem with the firmware on the chip, or they introduced a problem with their implementation or, if just the board is bad). (just talking out-loud) But no matter, I think at this point I feel I have enough information to go on at this point to give to the manufacturer and that they should be able to diagnose the problem from here. I'll report back with what they suggest for a resolution. Thanks again for your help, I really appreciate it. I'm not sure if supporting people in this mailing list is part of your daily job, but if it would help you out, send me an email directly with your managers name and I would be more than happy to send them some feedback. Thanks again, -Kevin On Wed, Oct 19, 2016 at 12:44 PM, Alex Williamson < alex.william...@redhat.com> wrote: > On Wed, 19 Oct 2016 12:16:30 -0500 > Kevin Vasko wrote: > > > Ah, ok. My bad. > > > > > > Ran > > > > #: setpci -s 3:00.0 82.w=8:8 > > > > SltSta: Status: AttnBtn- PowerFit- MRL- CmdCplt- PresDet+ Interlock- > > Changed: MRL- PresDet- LinkState- > > > > #: setpci -s 3:00.0 78.w=20:20 > > > > StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- > >Changed: MRL- PresDet- LinkState- > > > > > > When I run lspci -vvs 3:00.0 it is currently in this state > > > > StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- > >Changed: MRL- PresDet- LinkState- > > > > I didn't realize that I was needing to look at "PresDet", sorry. It does > > look like it is different than before so I assume the setpci commands > > changed it somewhere. > > > > The device (GPU) is still in the "(rev ff) (prog-if ff)" state. > > Ok, it would have been a long shot, the Presence Detect Changed bit > really should not have been having any effect on re-establishing the > link, it was just a notable difference between the working and > non-working examples. > > > Do you think this could be a GPU issue? I have not tried a different GPU > in > > the system. Would it be worthwhile trying an NVidia M4000 to see if I get > > the same results or do you think there is a problem with the PLX Riser? > > I can only speculate here, but I wouldn't expect PCIe link > characteristics to be significantly different between consumer and > workstation class cards. If you have one on hand, it certainly doesn't > hurt to try though. Perhaps performing the same test with a non-NVIDIA > card installed might be more enlightening, preferably a card with > similar PCIe width and speed, but any sort of data point might be > useful. > > I will note that NVIDIA does make use of PLX PCIe switches on some of > their devices, both the GRID K1 and Tesla M60 (probably others as well) > make use of a PLX PEX 8747 switch to pack multiple GPUs onto a single > card. So there might be a reasonable expectation of PLX switches > working with NVIDIA devices. What sort of tuning or special > configuration NVIDIA does on those since the switch is onboard the > card, I have no idea. Thanks, > > Alex > ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] Bus reset trouble with Titan-X
On Wed, 19 Oct 2016 12:16:30 -0500 Kevin Vasko wrote: > Ah, ok. My bad. > > > Ran > > #: setpci -s 3:00.0 82.w=8:8 > > SltSta: Status: AttnBtn- PowerFit- MRL- CmdCplt- PresDet+ Interlock- > Changed: MRL- PresDet- LinkState- > > #: setpci -s 3:00.0 78.w=20:20 > > StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- >Changed: MRL- PresDet- LinkState- > > > When I run lspci -vvs 3:00.0 it is currently in this state > > StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- >Changed: MRL- PresDet- LinkState- > > I didn't realize that I was needing to look at "PresDet", sorry. It does > look like it is different than before so I assume the setpci commands > changed it somewhere. > > The device (GPU) is still in the "(rev ff) (prog-if ff)" state. Ok, it would have been a long shot, the Presence Detect Changed bit really should not have been having any effect on re-establishing the link, it was just a notable difference between the working and non-working examples. > Do you think this could be a GPU issue? I have not tried a different GPU in > the system. Would it be worthwhile trying an NVidia M4000 to see if I get > the same results or do you think there is a problem with the PLX Riser? I can only speculate here, but I wouldn't expect PCIe link characteristics to be significantly different between consumer and workstation class cards. If you have one on hand, it certainly doesn't hurt to try though. Perhaps performing the same test with a non-NVIDIA card installed might be more enlightening, preferably a card with similar PCIe width and speed, but any sort of data point might be useful. I will note that NVIDIA does make use of PLX PCIe switches on some of their devices, both the GRID K1 and Tesla M60 (probably others as well) make use of a PLX PEX 8747 switch to pack multiple GPUs onto a single card. So there might be a reasonable expectation of PLX switches working with NVIDIA devices. What sort of tuning or special configuration NVIDIA does on those since the switch is onboard the card, I have no idea. Thanks, Alex ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] Bus reset trouble with Titan-X
Ah, ok. My bad. Ran #: setpci -s 3:00.0 82.w=8:8 SltSta: Status: AttnBtn- PowerFit- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet- LinkState- #: setpci -s 3:00.0 78.w=20:20 StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet- LinkState- When I run lspci -vvs 3:00.0 it is currently in this state StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet- LinkState- I didn't realize that I was needing to look at "PresDet", sorry. It does look like it is different than before so I assume the setpci commands changed it somewhere. The device (GPU) is still in the "(rev ff) (prog-if ff)" state. Do you think this could be a GPU issue? I have not tried a different GPU in the system. Would it be worthwhile trying an NVidia M4000 to see if I get the same results or do you think there is a problem with the PLX Riser? Thanks, -Kevin On Wed, Oct 19, 2016 at 11:57 AM, Alex Williamson < alex.william...@redhat.com> wrote: > On Wed, 19 Oct 2016 11:46:21 -0500 > Kevin Vasko wrote: > > > Alex, > > > > Thanks, but no luck. > > > > I ran : > > > > #:setpci -s 3:00.0 82.w=8:8 > > > > checked > > > > #:lspci -vvvs 3:00.0 > > > > MRL- was the same. > > PresDet+ on the Changed: line was the thing we were looking for. MRL > is a retention latch specifically for hotplug capable slots. > > > #: setpci -s 3:00.0 78.w=20:20 > > > > checked: > > > > #: lspci -vvs 3:00.0 > > > > MRL- was the same > > > > > > LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- > > ABWMgmt- > > > > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- > >Changed: MRL- PresDet+ LinkState- > > > > Just for my own knowledge what does "retrain" mean? I assume resetting > the > > bus and it reconnecting successfully? > > Retraining triggers a re-sync of the link width and speed parameters, > think of it like kicking an Ethernet connection to renegotiate > 10/100/1000Mbps speeds, a similar thing happens between a device and > the downstream port that it's connected to to determine the link > parameters. Thanks, > > Alex > ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] Bus reset trouble with Titan-X
On Wed, 19 Oct 2016 11:46:21 -0500 Kevin Vasko wrote: > Alex, > > Thanks, but no luck. > > I ran : > > #:setpci -s 3:00.0 82.w=8:8 > > checked > > #:lspci -vvvs 3:00.0 > > MRL- was the same. PresDet+ on the Changed: line was the thing we were looking for. MRL is a retention latch specifically for hotplug capable slots. > #: setpci -s 3:00.0 78.w=20:20 > > checked: > > #: lspci -vvs 3:00.0 > > MRL- was the same > > > LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- > ABWMgmt- > > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- >Changed: MRL- PresDet+ LinkState- > > Just for my own knowledge what does "retrain" mean? I assume resetting the > bus and it reconnecting successfully? Retraining triggers a re-sync of the link width and speed parameters, think of it like kicking an Ethernet connection to renegotiate 10/100/1000Mbps speeds, a similar thing happens between a device and the downstream port that it's connected to to determine the link parameters. Thanks, Alex ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] Bus reset trouble with Titan-X
Alex, Thanks, but no luck. I ran : #:setpci -s 3:00.0 82.w=8:8 checked #:lspci -vvvs 3:00.0 MRL- was the same. #: setpci -s 3:00.0 78.w=20:20 checked: #: lspci -vvs 3:00.0 MRL- was the same LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet+ LinkState- Just for my own knowledge what does "retrain" mean? I assume resetting the bus and it reconnecting successfully? Thanks again, -Kevin On Wed, Oct 19, 2016 at 10:50 AM, Alex Williamson < alex.william...@redhat.com> wrote: > On Wed, 19 Oct 2016 10:00:57 -0500 > Kevin Vasko wrote: > > > Sure thing. I'm attaching all of the logs I have to let you get a bigger > > picture (and anyone that might run into a similar issue). Hopefully I > > didn't mess anything up. > > > ... > > Here's the bit I was curious about: > > > #showing parent bridge of a device that has a failed > > #:lspci -vvvs 03:00 > > 03:00.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00 > > [Normal decode]) > ... > > LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency > > L0s <4us, L1 <8us > > ClockPM- Surprise- LLActRep- BwNot- > > LnkCtl: ASPM Disabled; Disabled- CommClk- > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > > LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- > > ABWMgmt- > > > The Link Status shows that it's in Gen1 mode at x0 width, so the link > failed to return to a working state after bus reset. Maybe a hint is > that the Slot Status register shows that the Presence Detect Changed bit > got flipped, but the Presence Detect State bit remains 1, indicating > that a card is present. However Presence Detect Changed Enable is not > set in the Slot Control register, so the OS doesn't get notified about > this. > > I wonder what would happen if we cleared the Presence Detect Changed > bit and tried to retrain the link. The express capability is at 0x68, > the slot status register is at 0x1a, bit 3 is the presence detect > changed bit and it's RW1C (read, write 1 to clear). Therefore to clear > the bit we could do: > > setpci -s 3:00.0 82.w=8:8 > > Recheck with lspci -vvvs 3:00.0 to check whether > > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- > Changed: MRL- PresDet+ LinkState- > > > Still reports + or - and possible if the link has decided to retrain. > To force a retrain we need to poke bit 5 in the link control register, > offset 0x10: > > setpci -s 3:00.0 78.w=20:20 > > Recheck lspci to see if there's any progress. > > ... > > #showing parent device that has a NON failed device > > #: lspci -vvvs 03:08 > > 03:08.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00 > > [Normal decode]) > ... > > LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency > > L0s <4us, L1 <8us > > ClockPM- Surprise- LLActRep- BwNot- > > LnkCtl: ASPM Disabled; Disabled- CommClk- > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > > LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- > > ABWMgmt- > > In this case the link has retrained to Gen3 x16 and of course the > downstream devices are accessible. The Presence Detect Changed bit is > set to - on this port. Thanks, > > Alex > ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] Bus reset trouble with Titan-X
On Wed, 19 Oct 2016 10:00:57 -0500 Kevin Vasko wrote: > Sure thing. I'm attaching all of the logs I have to let you get a bigger > picture (and anyone that might run into a similar issue). Hopefully I > didn't mess anything up. > ... Here's the bit I was curious about: > #showing parent bridge of a device that has a failed > #:lspci -vvvs 03:00 > 03:00.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00 > [Normal decode]) ... > LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency > L0s <4us, L1 <8us > ClockPM- Surprise- LLActRep- BwNot- > LnkCtl: ASPM Disabled; Disabled- CommClk- > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- > ABWMgmt- The Link Status shows that it's in Gen1 mode at x0 width, so the link failed to return to a working state after bus reset. Maybe a hint is that the Slot Status register shows that the Presence Detect Changed bit got flipped, but the Presence Detect State bit remains 1, indicating that a card is present. However Presence Detect Changed Enable is not set in the Slot Control register, so the OS doesn't get notified about this. I wonder what would happen if we cleared the Presence Detect Changed bit and tried to retrain the link. The express capability is at 0x68, the slot status register is at 0x1a, bit 3 is the presence detect changed bit and it's RW1C (read, write 1 to clear). Therefore to clear the bit we could do: setpci -s 3:00.0 82.w=8:8 Recheck with lspci -vvvs 3:00.0 to check whether SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet+ LinkState- Still reports + or - and possible if the link has decided to retrain. To force a retrain we need to poke bit 5 in the link control register, offset 0x10: setpci -s 3:00.0 78.w=20:20 Recheck lspci to see if there's any progress. ... > #showing parent device that has a NON failed device > #: lspci -vvvs 03:08 > 03:08.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00 > [Normal decode]) ... > LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency > L0s <4us, L1 <8us > ClockPM- Surprise- LLActRep- BwNot- > LnkCtl: ASPM Disabled; Disabled- CommClk- > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- > ABWMgmt- In this case the link has retrained to Gen3 x16 and of course the downstream devices are accessible. The Presence Detect Changed bit is set to - on this port. Thanks, Alex ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] Bus reset trouble with Titan-X
Sure thing. I'm attaching all of the logs I have to let you get a bigger picture (and anyone that might run into a similar issue). Hopefully I didn't mess anything up. Unfortunately, I've seen almost every single device fail at one point or another. I was thinking it might be isolated to a single PLX Riser card but I have now seen devices fail on every single parent device at one time or another. Based on that, I don't think I could narrow it down to a single PCISlot/PLX Riser that is the culprit. Unless both of these boards are bad, my conclusion is that this indicates a problem with the hardware as well. I completely agree that if the PCI Bus reset isn't working properly, nothing is going to work. I sent these steps to the manufacturer to see if they could reproduce the issue on their end. If they can then they will need to investigate on their end why the problem exists. If they can't, it is possible we have a bad set of boards in this machine. Thank you so much for your help. Really appreciate it. -Kevin #: lspci -tv | +-1f.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU | \-1f.2 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU \-[:00]-+-00.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2 +-01.0-[01]-- +-02.0-[02-08]00.0-[03-08]--+-00.0-[04]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | \-00.1 NVIDIA Corporation Device 0fb0 | +-04.0-[05]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | \-00.1 NVIDIA Corporation Device 0fb0 | +-08.0-[06]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | \-00.1 NVIDIA Corporation Device 0fb0 | +-0c.0-[07]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | \-00.1 NVIDIA Corporation Device 0fb0 | \-14.0-[08]00.0 Mellanox Technologies MT27500 Family [ConnectX-3] +-03.0-[09-12]00.0-[0a-12]--+-08.0-[0b-11]00.0-[0c-11]--+-00.0-[0d]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | | \-00.1 NVIDIA Corporation Device 0fb0 | | +-04.0-[0e]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | | \-00.1 NVIDIA Corporation Device 0fb0 | | +-08.0-[0f]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | | \-00.1 NVIDIA Corporation Device 0fb0 | | +-0c.0-[10]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | | | \-00.1 NVIDIA Corporation Device 0fb0 | | \-14.0-[11]--+-00.0 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.1 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.2 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.3 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.4 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.5 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | +-00.6 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | | \-00.7 Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function | \-10.0-[12]-- +-05.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Address Map, VTd_Misc, System Management +-05.1 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug +-05.2 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 RAS, Control Status and Global Errors # showing which ones are in failed state :# lspci -vnnn | grep NVIDIA 04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff) 04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev
Re: [vfio-users] Bus reset trouble with Titan-X
On Tue, 18 Oct 2016 17:48:59 -0500 Kevin Vasko wrote: > Alex, > > I think I was able to do it successfully and was scucessfully able to make > the thing fail. It went from (rev a1) to (rev ff) with response of the > header error. > > Instead of doing all devices I just did 1 at a time. > > this was the output of > > # lspci -tv > > +-02.0-[02-08]00.0-[03-08]--+-00.0-[04]--+--00.0 NVIDIA Corporation > GM200 [GeForce GTX TITAN X] > | \-00.1 > NVIDIA Corporation Device efb0 > +-04.0-[05]--+--00.0 NVIDIA > Corporation GM200 [GeForce GTX TITAN X] > | \-00.1 > NVIDIA Corporation Device efb0 > +-08.0-[06]--+--00.0 NVIDIA > Corporation GM200 [GeForce GTX TITAN X] > | \-00.1 > NVIDIA Corporation Device efb0 > +-0c.0-[07]--+--00.0 NVIDIA > Corporation GM200 [GeForce GTX TITAN X] > | \-00.1 > NVIDIA Corporation Device efb0 > +-14.0-[08]00.0 Mellanox > Technologies MT27600 Family [ConnectX-3] > +-03.0-[09-12]00.0-[0a-12]--+-08.0-[0b-11]00.0-[0c-11]--+--00.0-[0d]--+-00.0 > NVIDIA Corporation GM200 [GeForce GTX TITAN X] > > | \-00.1 NVIDIA Corporation Device 0fb0 > > +--04.0-[0e]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN > X] > > | \-00.1 NVIDIA Corporation Device 0fb0 > > +--08.0-[0f]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN > X] > > | \-00.1 NVIDIA Corporation Device 0fb0 > > +--0c.0-[10]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN > X] > > | \-00.1 NVIDIA Corporation Device 0fb0 > > I tried the first device > # virsh nodedev-detach --driver=kvm pci__04_00_0 > Device pci__04_00_0 detached > > # virsh nodedev-detach --driver=kvm pci__04_00_1 > Device pci__04_00_1 detached > > In the script I put > > DEVS=( > 03:00.0 > 04 > ) > > Ran it 100 times and got no error. > > Ran it for a different device 05 > > > > # virsh nodedev-detach --driver=kvm pci__05_00_0 > Device pci__05_00_0 detached > > # virsh nodedev-detach --driver=kvm pci__05_00_1 > Device pci__05_00_1 detached > > DEVS=( > 03:04.0 > 05: > ) > > > I saw this. > > #: for i in $(seq 1 100); do ./reset.sh; done > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev a1) > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev a1) > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev ff) > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff) > > I repeated this with another device on the system. > > I assume this indicates that that the device is not resetting properly? The > question is where do I go from here? Would this indicate a problem with the > PCI Reset code or a problematic hardware? Right, the PCIe link is not coming back for some reason, that seems like a hardware issue. Can you attach the output of 'sudo lspci -vvvs 3:04.0' when you're in this state (replace with the appropriate parent bridge depending on the failed device), maybe we can see if that downstream port is stuck in training. What I would do next is to test each card repeatedly. Do only some cards fail? If so, swap a working card and a non-working card, does the failure follow the card or the slot? I'm not sure what the result is going to be, but if we can't rely on a PCI bus reset then you're really not going to have any repeat-ability with assigning the GPUs. Thanks, Alex ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] Bus reset trouble with Titan-X
Alex, I think I was able to do it successfully and was scucessfully able to make the thing fail. It went from (rev a1) to (rev ff) with response of the header error. Instead of doing all devices I just did 1 at a time. this was the output of # lspci -tv +-02.0-[02-08]00.0-[03-08]--+-00.0-[04]--+--00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | \-00.1 NVIDIA Corporation Device efb0 +-04.0-[05]--+--00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | \-00.1 NVIDIA Corporation Device efb0 +-08.0-[06]--+--00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | \-00.1 NVIDIA Corporation Device efb0 +-0c.0-[07]--+--00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | \-00.1 NVIDIA Corporation Device efb0 +-14.0-[08]00.0 Mellanox Technologies MT27600 Family [ConnectX-3] +-03.0-[09-12]00.0-[0a-12]--+-08.0-[0b-11]00.0-[0c-11]--+--00.0-[0d]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | \-00.1 NVIDIA Corporation Device 0fb0 +--04.0-[0e]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | \-00.1 NVIDIA Corporation Device 0fb0 +--08.0-[0f]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | \-00.1 NVIDIA Corporation Device 0fb0 +--0c.0-[10]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN X] | \-00.1 NVIDIA Corporation Device 0fb0 I tried the first device # virsh nodedev-detach --driver=kvm pci__04_00_0 Device pci__04_00_0 detached # virsh nodedev-detach --driver=kvm pci__04_00_1 Device pci__04_00_1 detached In the script I put DEVS=( 03:00.0 04 ) Ran it 100 times and got no error. Ran it for a different device 05 # virsh nodedev-detach --driver=kvm pci__05_00_0 Device pci__05_00_0 detached # virsh nodedev-detach --driver=kvm pci__05_00_1 Device pci__05_00_1 detached DEVS=( 03:04.0 05: ) I saw this. #: for i in $(seq 1 100); do ./reset.sh; done 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1) 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1) 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev ff) 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff) I repeated this with another device on the system. I assume this indicates that that the device is not resetting properly? The question is where do I go from here? Would this indicate a problem with the PCI Reset code or a problematic hardware? -Kevin On Tue, Oct 18, 2016 at 11:49 AM, Alex Williamson < alex.william...@redhat.com> wrote: > On Tue, 18 Oct 2016 11:04:14 -0500 > Kevin Vasko wrote: > > > Alex, > > > > (crossing fingers this goes into the correct thread). > > > > I upgraded this machine to 4.4.0-42-generic. > > > > I spawned a single VM with 1 GPU immediately after the kernel upgrade. It > > works. It attached properly and in the VM when I ran lspci, it showed up > > properly. > > > > I deleted that VM and started up the system with 4x GPUs, and then it > > started exhibiting the same issue. Three of the GPUs attached properly. > > > > This appears to be that it was not resolved with upgrading the kernel. If > > you don't mind providing instructions on resetting the bus to see if I > can > > narrow this down further (what you were talking about yesterday) that > would > > be appreciated. Any other suggestions would be greatly appreciated as > well. > > Ok, you're going to need to identify the parent bridge for the GPUs. > You can do this with 'lspci -tv'. If you need help, send the output of > that command. Here's an example: > > # lspci -tv > -[:00]-+-00.0 Intel Corporation 5520/5500/X58 I/O Hub to ESI Port >+-01.0-[01]--+-00.0 Intel Corporation 82576 Gigabit Network > Connection >|\-00.1 Intel Corporation 82576 Gigabit Network > Connection >+-03.0-[02]00.0 Fresco Logic FL1100 USB 3.0 Host Controller >+-07.0-[03]--+-00.0 Intel Corporation Ethernet Controller X710 > for 10GbE SFP+ >|\-00.1 Intel Corporation Ethernet Controller X710 > for 10GbE SFP+ >... > > Say I want to do a bus reset on the X710 ethernet devices at 03:00.0 > and 03:00.1. This should be similar to a GPU and companion audio
Re: [vfio-users] Bus reset trouble with Titan-X
On Tue, 18 Oct 2016 11:04:14 -0500 Kevin Vasko wrote: > Alex, > > (crossing fingers this goes into the correct thread). > > I upgraded this machine to 4.4.0-42-generic. > > I spawned a single VM with 1 GPU immediately after the kernel upgrade. It > works. It attached properly and in the VM when I ran lspci, it showed up > properly. > > I deleted that VM and started up the system with 4x GPUs, and then it > started exhibiting the same issue. Three of the GPUs attached properly. > > This appears to be that it was not resolved with upgrading the kernel. If > you don't mind providing instructions on resetting the bus to see if I can > narrow this down further (what you were talking about yesterday) that would > be appreciated. Any other suggestions would be greatly appreciated as well. Ok, you're going to need to identify the parent bridge for the GPUs. You can do this with 'lspci -tv'. If you need help, send the output of that command. Here's an example: # lspci -tv -[:00]-+-00.0 Intel Corporation 5520/5500/X58 I/O Hub to ESI Port +-01.0-[01]--+-00.0 Intel Corporation 82576 Gigabit Network Connection |\-00.1 Intel Corporation 82576 Gigabit Network Connection +-03.0-[02]00.0 Fresco Logic FL1100 USB 3.0 Host Controller +-07.0-[03]--+-00.0 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ |\-00.1 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ ... Say I want to do a bus reset on the X710 ethernet devices at 03:00.0 and 03:00.1. This should be similar to a GPU and companion audio device. The parent bridge is device 00:07.0. I can double check this by running lspci on this device: # lspci -vs 00:07.0 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 27 Bus: primary=00, secondary=03, subordinate=03, sec-latency=0 The secondary bus is 03, thus it's the parent device of 03:00.[01]. Prior to performing a bus reset, attach all the affected devices to a driver that isn't going to be making use of the devices, for instance pci-stub. We can do this with virsh using: # virsh nodedev-detach --driver=kvm pci__03_00_0 Device pci__03_00_0 detached # virsh nodedev-detach --driver=kvm pci__03_00_1 Device pci__03_00_1 detached The "--driver=kvm" simply selects pci-stub rather than vfio-pci, which would otherwise be the default. Also note that after a bus reset, the downstream devices are not going to be usable until after a system reboot. Our goal is to see how reliably we can perform a bus reset and have the devices re-appear, we cannot make use of them beyond running lspci on them without a system reboot. Ok, so for each GPU you should know the parent bridge, the address of the GPUs themselves, and each GPU and companion audio device should be bound to pci-stub. Using the dual port NICs as stand-ins for your GPUs, we need a script like this: # cat reset.sh #!/bin/sh DEVS=( 00:01.0 # parent of 01: 01: # affected devices of 01.0 00:07.0 # parent of 03: 03: # affected devices of 07.0 # change the entries above for your system # you will have more devices here, a parent bridge # followed by the bus of the affected GPU, 0f:, 10:, 0e:, 0d: ) i=0 while [ $i -lt ${#DEVS[@]} ]; do setpci -s ${DEVS[$i]} 3e.w=40:40 # Set 2ndary bus reset bit sleep 0.2 setpci -s ${DEVS[$i]} 3e.w=0:40 # Clear 2ndary bus reset bit sleep 1 # when this reports abnormally, we've failed lspci -s ${DEVS[$(( $i + 1 ))]} i=$(( $i + 2 )) done Don't forget to chmod 755 the script. Run it once and it should produce something like this (of course with your GPUs instead of my NICs): # ./reset.sh 01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) 03:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) If that works, then run it 100 times: # for i in $(seq 1 100); do ./reset.sh; done If you start seeing "(rev ff) (prog-if ff)" then the device has failed. (left as an exercise to the reader to automatically stop on this condition ;) Please report what you find and remember that it's expected that you will need to reboot the system after performing this test to get the devices back into a workable state. We're not saving and restoring the state of the devices around reset. Thanks, Alex ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] Bus reset trouble with Titan-X (was Re: Welcome to the "vfio-users" mailing list (Digest mode))
Thanks. I'm an idiot. I just replied to the email directly after the subscription and wasn't paying attention. Thank you for correcting it. I was originally running 3.13.0-86-generic upgraded to the 3.19 version to try before I posted this, but got the same results. I'll try a newer version of the kernel and see what happens. Sorry to be dense but what do you mean by "retrain properly"? I assume you mean that once it fails to reset it just never recovers? We have 2 other machines that I've never seen this problem with so what what you are saying makes sense. This system does have a slightly more specialized PCI bus to be able to stick 8 cards on a single bus (at least that is my understanding), so at this point, either I'm hitting a bug that is fixed in the kernel, or this PCI bus is not doing something that vfio-pci is expecting (would be my speculation). I'll report back my findings tomorrow. Thanks for the help. -Kevin On Mon, Oct 17, 2016 at 5:53 PM, Alex Williamson wrote: > (generally a good idea to have a useful subject line) > > On Mon, 17 Oct 2016 16:26:15 -0500 > Kevin Vasko wrote: > > > > Any suggestions on debugging a !!! Unknown header type 7f? > > > > This usually means that the device didn't come back from bus reset and > re-reading the PCI config space where the device was just gives a -1 > response. lspci tries to interpret that bogus data and gives results > like you see. You might try a newer kernel, we've probably fixed some > things in the bus reset path since v3.19. It looks like you continue > to see the bogus data once it gets into this state, so it's probably > not a "simple" device coming out of reset too slowly problem. Possibly > the PCIe link doesn't retrain properly sometimes after a bus reset. If > a new kernel doesn't help, I could give you instructions for performing > a bus reset with setpci and you could test how reliably you can reset > the device and read config space after. Thanks, > > Alex > ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users