Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-19 Thread Kevin Vasko
Thanks for the information. I didn't even noticed the Presence Detect
Changed bit difference (granted that is mostly due not knowing what to look
for and being a little over my head at this point).

I wouldn't figure that there would be a difference in using a different
card but at this point I'm out of things to try on my end. As for trying a
non-NVIDIA card, we don't have any available that I'm aware of so wouldn't
be able to test that unfortunately.

I'm not very familiar with the PLX technology, and definitely not sure what
the manufacturer might have done with this particular board (e.g. if this
is a problem with the firmware on the chip, or they introduced a problem
with their implementation or, if just the board is bad). (just talking
out-loud)

But no matter, I think at this point I feel I have enough information to go
on at this point to give to the manufacturer and that they should be able
to diagnose the problem from here.

I'll report back with what they suggest for a resolution.

Thanks again for your help, I really appreciate it. I'm not sure if
supporting people in this mailing list is part of your daily job, but if it
would help you out, send me an email directly with your managers name and I
would be more than happy to send them some feedback.

Thanks again,

-Kevin




On Wed, Oct 19, 2016 at 12:44 PM, Alex Williamson <
alex.william...@redhat.com> wrote:

> On Wed, 19 Oct 2016 12:16:30 -0500
> Kevin Vasko  wrote:
>
> > Ah, ok. My bad.
> >
> >
> > Ran
> >
> > #: setpci -s 3:00.0 82.w=8:8
> >
> > SltSta: Status: AttnBtn- PowerFit- MRL- CmdCplt- PresDet+ Interlock-
> > Changed: MRL- PresDet- LinkState-
> >
> > #: setpci -s 3:00.0 78.w=20:20
> >
> > StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
> >Changed: MRL- PresDet- LinkState-
> >
> >
> > When I run lspci -vvs 3:00.0 it is currently in this state
> >
> > StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
> >Changed: MRL- PresDet- LinkState-
> >
> > I didn't realize that I was needing to look at "PresDet", sorry. It does
> > look like it is different than before so I assume the setpci commands
> > changed it somewhere.
> >
> > The device (GPU) is still in the "(rev ff) (prog-if ff)" state.
>
> Ok, it would have been a long shot, the Presence Detect Changed bit
> really should not have been having any effect on re-establishing the
> link, it was just a notable difference between the working and
> non-working examples.
>
> > Do you think this could be a GPU issue? I have not tried a different GPU
> in
> > the system. Would it be worthwhile trying an NVidia M4000 to see if I get
> > the same results or do you think there is a problem with the PLX Riser?
>
> I can only speculate here, but I wouldn't expect PCIe link
> characteristics to be significantly different between consumer and
> workstation class cards.  If you have one on hand, it certainly doesn't
> hurt to try though.  Perhaps performing the same test with a non-NVIDIA
> card installed might be more enlightening, preferably a card with
> similar PCIe width and speed, but any sort of data point might be
> useful.
>
> I will note that NVIDIA does make use of PLX PCIe switches on some of
> their devices, both the GRID K1 and Tesla M60 (probably others as well)
> make use of a PLX PEX 8747 switch to pack multiple GPUs onto a single
> card.  So there might be a reasonable expectation of PLX switches
> working with NVIDIA devices.  What sort of tuning or special
> configuration NVIDIA does on those since the switch is onboard the
> card, I have no idea.  Thanks,
>
> Alex
>
___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users


Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-19 Thread Alex Williamson
On Wed, 19 Oct 2016 12:16:30 -0500
Kevin Vasko  wrote:

> Ah, ok. My bad.
> 
> 
> Ran
> 
> #: setpci -s 3:00.0 82.w=8:8
> 
> SltSta: Status: AttnBtn- PowerFit- MRL- CmdCplt- PresDet+ Interlock-
> Changed: MRL- PresDet- LinkState-
> 
> #: setpci -s 3:00.0 78.w=20:20
> 
> StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
>Changed: MRL- PresDet- LinkState-
> 
> 
> When I run lspci -vvs 3:00.0 it is currently in this state
> 
> StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
>Changed: MRL- PresDet- LinkState-
> 
> I didn't realize that I was needing to look at "PresDet", sorry. It does
> look like it is different than before so I assume the setpci commands
> changed it somewhere.
> 
> The device (GPU) is still in the "(rev ff) (prog-if ff)" state.

Ok, it would have been a long shot, the Presence Detect Changed bit
really should not have been having any effect on re-establishing the
link, it was just a notable difference between the working and
non-working examples.
 
> Do you think this could be a GPU issue? I have not tried a different GPU in
> the system. Would it be worthwhile trying an NVidia M4000 to see if I get
> the same results or do you think there is a problem with the PLX Riser?

I can only speculate here, but I wouldn't expect PCIe link
characteristics to be significantly different between consumer and
workstation class cards.  If you have one on hand, it certainly doesn't
hurt to try though.  Perhaps performing the same test with a non-NVIDIA
card installed might be more enlightening, preferably a card with
similar PCIe width and speed, but any sort of data point might be
useful.

I will note that NVIDIA does make use of PLX PCIe switches on some of
their devices, both the GRID K1 and Tesla M60 (probably others as well)
make use of a PLX PEX 8747 switch to pack multiple GPUs onto a single
card.  So there might be a reasonable expectation of PLX switches
working with NVIDIA devices.  What sort of tuning or special
configuration NVIDIA does on those since the switch is onboard the
card, I have no idea.  Thanks,

Alex

___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users


Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-19 Thread Kevin Vasko
Ah, ok. My bad.


Ran

#: setpci -s 3:00.0 82.w=8:8

SltSta: Status: AttnBtn- PowerFit- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet- LinkState-

#: setpci -s 3:00.0 78.w=20:20

StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
   Changed: MRL- PresDet- LinkState-


When I run lspci -vvs 3:00.0 it is currently in this state

StlSta: Status AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
   Changed: MRL- PresDet- LinkState-

I didn't realize that I was needing to look at "PresDet", sorry. It does
look like it is different than before so I assume the setpci commands
changed it somewhere.

The device (GPU) is still in the "(rev ff) (prog-if ff)" state.

Do you think this could be a GPU issue? I have not tried a different GPU in
the system. Would it be worthwhile trying an NVidia M4000 to see if I get
the same results or do you think there is a problem with the PLX Riser?

Thanks,

-Kevin






On Wed, Oct 19, 2016 at 11:57 AM, Alex Williamson <
alex.william...@redhat.com> wrote:

> On Wed, 19 Oct 2016 11:46:21 -0500
> Kevin Vasko  wrote:
>
> > Alex,
> >
> > Thanks, but no luck.
> >
> > I ran :
> >
> > #:setpci -s 3:00.0 82.w=8:8
> >
> > checked
> >
> > #:lspci -vvvs 3:00.0
> >
> > MRL- was the same.
>
> PresDet+ on the Changed: line was the thing we were looking for.  MRL
> is a retention latch specifically for hotplug capable slots.
>
> > #: setpci -s 3:00.0 78.w=20:20
> >
> > checked:
> >
> > #: lspci -vvs 3:00.0
> >
> > MRL- was the same
> >
> >
> > LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt-
> > ABWMgmt-
> >
> > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
> >Changed: MRL- PresDet+ LinkState-
> >
> > Just for my own knowledge what does "retrain" mean? I assume resetting
> the
> > bus and it reconnecting successfully?
>
> Retraining triggers a re-sync of the link width and speed parameters,
> think of it like kicking an Ethernet connection to renegotiate
> 10/100/1000Mbps speeds, a similar thing happens between a device and
> the downstream port that it's connected to to determine the link
> parameters.  Thanks,
>
> Alex
>
___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users


Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-19 Thread Alex Williamson
On Wed, 19 Oct 2016 11:46:21 -0500
Kevin Vasko  wrote:

> Alex,
> 
> Thanks, but no luck.
> 
> I ran :
> 
> #:setpci -s 3:00.0 82.w=8:8
> 
> checked
> 
> #:lspci -vvvs 3:00.0
> 
> MRL- was the same.

PresDet+ on the Changed: line was the thing we were looking for.  MRL
is a retention latch specifically for hotplug capable slots.
 
> #: setpci -s 3:00.0 78.w=20:20
> 
> checked:
> 
> #: lspci -vvs 3:00.0
> 
> MRL- was the same
> 
> 
> LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt-
> ABWMgmt-
> 
> SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
>Changed: MRL- PresDet+ LinkState-
> 
> Just for my own knowledge what does "retrain" mean? I assume resetting the
> bus and it reconnecting successfully?

Retraining triggers a re-sync of the link width and speed parameters,
think of it like kicking an Ethernet connection to renegotiate
10/100/1000Mbps speeds, a similar thing happens between a device and
the downstream port that it's connected to to determine the link
parameters.  Thanks,

Alex

___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users


Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-19 Thread Kevin Vasko
Alex,

Thanks, but no luck.

I ran :

#:setpci -s 3:00.0 82.w=8:8

checked

#:lspci -vvvs 3:00.0

MRL- was the same.

#: setpci -s 3:00.0 78.w=20:20

checked:

#: lspci -vvs 3:00.0

MRL- was the same


LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt-
ABWMgmt-

SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
   Changed: MRL- PresDet+ LinkState-

Just for my own knowledge what does "retrain" mean? I assume resetting the
bus and it reconnecting successfully?

Thanks again,

-Kevin

On Wed, Oct 19, 2016 at 10:50 AM, Alex Williamson <
alex.william...@redhat.com> wrote:

> On Wed, 19 Oct 2016 10:00:57 -0500
> Kevin Vasko  wrote:
>
> > Sure thing. I'm attaching all of the logs I have to let you get a bigger
> > picture (and anyone that might run into a similar issue). Hopefully I
> > didn't mess anything up.
> >
> ...
>
> Here's the bit I was curious about:
>
> > #showing parent bridge of a device that has a failed
> > #:lspci -vvvs 03:00
> > 03:00.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00
> > [Normal decode])
> ...
> > LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
> > L0s <4us, L1 <8us
> > ClockPM- Surprise- LLActRep- BwNot-
> > LnkCtl: ASPM Disabled; Disabled- CommClk-
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt-
> > ABWMgmt-
>
>
> The Link Status shows that it's in Gen1 mode at x0 width, so the link
> failed to return to a working state after bus reset.  Maybe a hint is
> that the Slot Status register shows that the Presence Detect Changed bit
> got flipped, but the Presence Detect State bit remains 1, indicating
> that a card is present.  However Presence Detect Changed Enable is not
> set in the Slot Control register, so the OS doesn't get notified about
> this.
>
> I wonder what would happen if we cleared the Presence Detect Changed
> bit and tried to retrain the link.  The express capability is at 0x68,
> the slot status register is at 0x1a, bit 3 is the presence detect
> changed bit and it's RW1C (read, write 1 to clear).  Therefore to clear
> the bit we could do:
>
> setpci -s 3:00.0 82.w=8:8
>
> Recheck with lspci -vvvs 3:00.0 to check whether
>
> SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
> Changed: MRL- PresDet+ LinkState-
>   
>
> Still reports + or - and possible if the link has decided to retrain.
> To force a retrain we need to poke bit 5 in the link control register,
> offset 0x10:
>
> setpci -s 3:00.0 78.w=20:20
>
> Recheck lspci to see if there's any progress.
>
> ...
> > #showing parent device that has a NON failed device
> > #: lspci -vvvs 03:08
> > 03:08.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00
> > [Normal decode])
> ...
> > LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
> > L0s <4us, L1 <8us
> > ClockPM- Surprise- LLActRep- BwNot-
> > LnkCtl: ASPM Disabled; Disabled- CommClk-
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt-
> > ABWMgmt-
>
> In this case the link has retrained to Gen3 x16 and of course the
> downstream devices are accessible.  The Presence Detect Changed bit is
> set to - on this port.  Thanks,
>
> Alex
>
___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users


Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-19 Thread Alex Williamson
On Wed, 19 Oct 2016 10:00:57 -0500
Kevin Vasko  wrote:

> Sure thing. I'm attaching all of the logs I have to let you get a bigger
> picture (and anyone that might run into a similar issue). Hopefully I
> didn't mess anything up.
> 
...

Here's the bit I was curious about:

> #showing parent bridge of a device that has a failed
> #:lspci -vvvs 03:00
> 03:00.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00
> [Normal decode])
...
> LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
> L0s <4us, L1 <8us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt-
> ABWMgmt-


The Link Status shows that it's in Gen1 mode at x0 width, so the link
failed to return to a working state after bus reset.  Maybe a hint is
that the Slot Status register shows that the Presence Detect Changed bit
got flipped, but the Presence Detect State bit remains 1, indicating
that a card is present.  However Presence Detect Changed Enable is not
set in the Slot Control register, so the OS doesn't get notified about
this.

I wonder what would happen if we cleared the Presence Detect Changed
bit and tried to retrain the link.  The express capability is at 0x68,
the slot status register is at 0x1a, bit 3 is the presence detect
changed bit and it's RW1C (read, write 1 to clear).  Therefore to clear
the bit we could do:

setpci -s 3:00.0 82.w=8:8

Recheck with lspci -vvvs 3:00.0 to check whether

SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet+ LinkState-
  

Still reports + or - and possible if the link has decided to retrain.
To force a retrain we need to poke bit 5 in the link control register,
offset 0x10:

setpci -s 3:00.0 78.w=20:20

Recheck lspci to see if there's any progress.

... 
> #showing parent device that has a NON failed device
> #: lspci -vvvs 03:08
> 03:08.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00
> [Normal decode])
...
> LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
> L0s <4us, L1 <8us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt-
> ABWMgmt-

In this case the link has retrained to Gen3 x16 and of course the
downstream devices are accessible.  The Presence Detect Changed bit is
set to - on this port.  Thanks,

Alex

___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users


Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-19 Thread Kevin Vasko
Sure thing. I'm attaching all of the logs I have to let you get a bigger
picture (and anyone that might run into a similar issue). Hopefully I
didn't mess anything up.

Unfortunately, I've seen almost every single device fail at one point or
another. I was thinking it might be isolated to a single PLX Riser card but
I have now seen devices fail on every single parent device at one time or
another. Based on that, I don't think I could narrow it down to a single
PCISlot/PLX Riser that is the culprit. Unless both of these boards are bad,
my conclusion is that this indicates a problem with the hardware as well. I
completely agree that if the PCI Bus reset isn't working properly, nothing
is going to work.

I sent these steps to the manufacturer to see if they could reproduce the
issue on their end. If they can then they will need to investigate on their
end why the problem exists. If they can't, it is possible we have a bad set
of boards in this machine.

Thank you so much for your help. Really appreciate it.

-Kevin

#: lspci -tv

 |   +-1f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 |   \-1f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 \-[:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
 +-01.0-[01]--
 +-02.0-[02-08]00.0-[03-08]--+-00.0-[04]--+-00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
 |   |
 \-00.1  NVIDIA Corporation Device 0fb0
 |   +-04.0-[05]--+-00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
 |   |  \-00.1
 NVIDIA Corporation Device 0fb0
 |   +-08.0-[06]--+-00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
 |   |  \-00.1
 NVIDIA Corporation Device 0fb0
 |   +-0c.0-[07]--+-00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
 |   |   \-00.1
 NVIDIA Corporation Device 0fb0
 |   \-14.0-[08]00.0  Mellanox
Technologies MT27500 Family [ConnectX-3]

 
+-03.0-[09-12]00.0-[0a-12]--+-08.0-[0b-11]00.0-[0c-11]--+-00.0-[0d]--+-00.0
 NVIDIA Corporation GM200 [GeForce GTX TITAN X]
 |   |
  | \-00.1  NVIDIA Corporation
Device 0fb0
 |   |
  +-04.0-[0e]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
 |   |
  |   \-00.1  NVIDIA Corporation Device 0fb0
 |   |
  +-08.0-[0f]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
 |   |
  | \-00.1  NVIDIA Corporation Device 0fb0
 |   |
  +-0c.0-[10]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
 |   |
  |   \-00.1  NVIDIA Corporation Device 0fb0
 |   |
  \-14.0-[11]--+-00.0  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
 |   |
   +-00.1  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
 |   |
   +-00.2  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
 |   |
   +-00.3  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
 |   |
   +-00.4  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
 |   |
   +-00.5  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
 |   |
   +-00.6  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
 |   |
   \-00.7  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
 |   \-10.0-[12]--
 +-05.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Address Map, VTd_Misc, System Management
 +-05.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot
Plug
 +-05.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 RAS,
Control Status and Global Errors


# showing which ones are in failed state
:# lspci -vnnn | grep NVIDIA

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)
04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev 

Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-18 Thread Alex Williamson
On Tue, 18 Oct 2016 17:48:59 -0500
Kevin Vasko  wrote:

> Alex,
> 
> I think I was able to do it successfully and was scucessfully able to make
> the thing fail. It went from (rev a1) to (rev ff) with response of the
> header error.
> 
> Instead of doing all devices I just did 1 at a time.
> 
> this was the output of
> 
> # lspci -tv
> 
> +-02.0-[02-08]00.0-[03-08]--+-00.0-[04]--+--00.0  NVIDIA Corporation
> GM200 [GeForce GTX TITAN X]
> | \-00.1
> NVIDIA Corporation Device efb0
> +-04.0-[05]--+--00.0  NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
> | \-00.1
> NVIDIA Corporation Device efb0
> +-08.0-[06]--+--00.0  NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
> | \-00.1
> NVIDIA Corporation Device efb0
> +-0c.0-[07]--+--00.0  NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
> | \-00.1
> NVIDIA Corporation Device efb0
> +-14.0-[08]00.0   Mellanox
> Technologies MT27600 Family [ConnectX-3]
> +-03.0-[09-12]00.0-[0a-12]--+-08.0-[0b-11]00.0-[0c-11]--+--00.0-[0d]--+-00.0
>  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
> 
>   |  \-00.1  NVIDIA Corporation Device 0fb0
> 
>   +--04.0-[0e]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
> 
>   |  \-00.1  NVIDIA Corporation Device 0fb0
> 
>   +--08.0-[0f]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
> 
>   |  \-00.1  NVIDIA Corporation Device 0fb0
> 
>   +--0c.0-[10]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
> 
>   |  \-00.1  NVIDIA Corporation Device 0fb0
> 
> I tried the first device
> # virsh nodedev-detach --driver=kvm pci__04_00_0
> Device pci__04_00_0 detached
> 
> # virsh nodedev-detach --driver=kvm pci__04_00_1
> Device pci__04_00_1 detached
> 
> In the script I put
> 
> DEVS=(
> 03:00.0
> 04
> )
> 
> Ran it 100 times and got no error.
> 
> Ran it for a different device 05
> 
> 
> 
> # virsh nodedev-detach --driver=kvm pci__05_00_0
> Device pci__05_00_0 detached
> 
> # virsh nodedev-detach --driver=kvm pci__05_00_1
> Device pci__05_00_1 detached
> 
> DEVS=(
> 03:04.0
> 05:
> )
> 
> 
> I saw this.
> 
> #: for i in $(seq 1 100); do ./reset.sh; done
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev ff)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff)
> 
> I repeated this with another device on the system.
> 
> I assume this indicates that that the device is not resetting properly? The
> question is where do I go from here? Would this indicate a problem with the
> PCI Reset code or a problematic hardware?

Right, the PCIe link is not coming back for some reason, that seems
like a hardware issue.  Can you attach the output of 'sudo lspci -vvvs
3:04.0' when you're in this state (replace with the appropriate parent
bridge depending on the failed device), maybe we can see if that
downstream port is stuck in training.

What I would do next is to test each card repeatedly.  Do only some
cards fail?  If so, swap a working card and a non-working card, does
the failure follow the card or the slot?  I'm not sure what the result
is going to be, but if we can't rely on a PCI bus reset then you're
really not going to have any repeat-ability with assigning the GPUs.
Thanks,

Alex

___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users


Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-18 Thread Kevin Vasko
Alex,

I think I was able to do it successfully and was scucessfully able to make
the thing fail. It went from (rev a1) to (rev ff) with response of the
header error.

Instead of doing all devices I just did 1 at a time.

this was the output of

# lspci -tv

+-02.0-[02-08]00.0-[03-08]--+-00.0-[04]--+--00.0  NVIDIA Corporation
GM200 [GeForce GTX TITAN X]
| \-00.1
NVIDIA Corporation Device efb0
+-04.0-[05]--+--00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
| \-00.1
NVIDIA Corporation Device efb0
+-08.0-[06]--+--00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
| \-00.1
NVIDIA Corporation Device efb0
+-0c.0-[07]--+--00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
| \-00.1
NVIDIA Corporation Device efb0
+-14.0-[08]00.0   Mellanox
Technologies MT27600 Family [ConnectX-3]
+-03.0-[09-12]00.0-[0a-12]--+-08.0-[0b-11]00.0-[0c-11]--+--00.0-[0d]--+-00.0
 NVIDIA Corporation GM200 [GeForce GTX TITAN X]

  |  \-00.1  NVIDIA Corporation Device 0fb0

  +--04.0-[0e]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
X]

  |  \-00.1  NVIDIA Corporation Device 0fb0

  +--08.0-[0f]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
X]

  |  \-00.1  NVIDIA Corporation Device 0fb0

  +--0c.0-[10]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
X]

  |  \-00.1  NVIDIA Corporation Device 0fb0

I tried the first device
# virsh nodedev-detach --driver=kvm pci__04_00_0
Device pci__04_00_0 detached

# virsh nodedev-detach --driver=kvm pci__04_00_1
Device pci__04_00_1 detached

In the script I put

DEVS=(
03:00.0
04
)

Ran it 100 times and got no error.

Ran it for a different device 05



# virsh nodedev-detach --driver=kvm pci__05_00_0
Device pci__05_00_0 detached

# virsh nodedev-detach --driver=kvm pci__05_00_1
Device pci__05_00_1 detached

DEVS=(
03:04.0
05:
)


I saw this.

#: for i in $(seq 1 100); do ./reset.sh; done
05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev a1)
05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev a1)
05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev ff)
05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff)

I repeated this with another device on the system.

I assume this indicates that that the device is not resetting properly? The
question is where do I go from here? Would this indicate a problem with the
PCI Reset code or a problematic hardware?


-Kevin





On Tue, Oct 18, 2016 at 11:49 AM, Alex Williamson <
alex.william...@redhat.com> wrote:

> On Tue, 18 Oct 2016 11:04:14 -0500
> Kevin Vasko  wrote:
>
> > Alex,
> >
> > (crossing fingers this goes into the correct thread).
> >
> > I upgraded this machine to 4.4.0-42-generic.
> >
> > I spawned a single VM with 1 GPU immediately after the kernel upgrade. It
> > works. It attached properly and in the VM when I ran lspci, it showed up
> > properly.
> >
> > I deleted that VM and started up the system with 4x GPUs, and then it
> > started exhibiting the same issue. Three of the GPUs attached properly.
> >
> > This appears to be that it was not resolved with upgrading the kernel. If
> > you don't mind providing instructions on resetting the bus to see if I
> can
> > narrow this down further (what you were talking about yesterday) that
> would
> > be appreciated. Any other suggestions would be greatly appreciated as
> well.
>
> Ok, you're going to need to identify the parent bridge for the GPUs.
> You can do this with 'lspci -tv'.  If you need help, send the output of
> that command.  Here's an example:
>
> # lspci -tv
> -[:00]-+-00.0  Intel Corporation 5520/5500/X58 I/O Hub to ESI Port
>+-01.0-[01]--+-00.0  Intel Corporation 82576 Gigabit Network
> Connection
>|\-00.1  Intel Corporation 82576 Gigabit Network
> Connection
>+-03.0-[02]00.0  Fresco Logic FL1100 USB 3.0 Host Controller
>+-07.0-[03]--+-00.0  Intel Corporation Ethernet Controller X710
> for 10GbE SFP+
>|\-00.1  Intel Corporation Ethernet Controller X710
> for 10GbE SFP+
>...
>
> Say I want to do a bus reset on the X710 ethernet devices at 03:00.0
> and 03:00.1.  This should be similar to a GPU and companion audio

Re: [vfio-users] Bus reset trouble with Titan-X

2016-10-18 Thread Alex Williamson
On Tue, 18 Oct 2016 11:04:14 -0500
Kevin Vasko  wrote:

> Alex,
> 
> (crossing fingers this goes into the correct thread).
> 
> I upgraded this machine to 4.4.0-42-generic.
> 
> I spawned a single VM with 1 GPU immediately after the kernel upgrade. It
> works. It attached properly and in the VM when I ran lspci, it showed up
> properly.
> 
> I deleted that VM and started up the system with 4x GPUs, and then it
> started exhibiting the same issue. Three of the GPUs attached properly.
> 
> This appears to be that it was not resolved with upgrading the kernel. If
> you don't mind providing instructions on resetting the bus to see if I can
> narrow this down further (what you were talking about yesterday) that would
> be appreciated. Any other suggestions would be greatly appreciated as well.

Ok, you're going to need to identify the parent bridge for the GPUs.
You can do this with 'lspci -tv'.  If you need help, send the output of
that command.  Here's an example:

# lspci -tv
-[:00]-+-00.0  Intel Corporation 5520/5500/X58 I/O Hub to ESI Port
   +-01.0-[01]--+-00.0  Intel Corporation 82576 Gigabit Network 
Connection
   |\-00.1  Intel Corporation 82576 Gigabit Network 
Connection
   +-03.0-[02]00.0  Fresco Logic FL1100 USB 3.0 Host Controller
   +-07.0-[03]--+-00.0  Intel Corporation Ethernet Controller X710 for 
10GbE SFP+
   |\-00.1  Intel Corporation Ethernet Controller X710 for 
10GbE SFP+
   ...

Say I want to do a bus reset on the X710 ethernet devices at 03:00.0
and 03:00.1.  This should be similar to a GPU and companion audio
device.  The parent bridge is device 00:07.0.  I can double check this
by running lspci on this device:

# lspci -vs 00:07.0
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 7 (rev 22) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 27
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
 

The secondary bus is 03, thus it's the parent device of 03:00.[01].

Prior to performing a bus reset, attach all the affected devices to a
driver that isn't going to be making use of the devices, for instance
pci-stub.  We can do this with virsh using:

# virsh nodedev-detach --driver=kvm pci__03_00_0
Device pci__03_00_0 detached

# virsh nodedev-detach --driver=kvm pci__03_00_1
Device pci__03_00_1 detached

The "--driver=kvm" simply selects pci-stub rather than vfio-pci, which
would otherwise be the default.

Also note that after a bus reset, the downstream devices are not going
to be usable until after a system reboot.  Our goal is to see how
reliably we can perform a bus reset and have the devices re-appear, we
cannot make use of them beyond running lspci on them without a system
reboot.

Ok, so for each GPU you should know the parent bridge, the address of
the GPUs themselves, and each GPU and companion audio device should be
bound to pci-stub.

Using the dual port NICs as stand-ins for your GPUs, we need a script
like this:

# cat reset.sh 
#!/bin/sh

DEVS=(
00:01.0 # parent of 01:
01: # affected devices of 01.0
00:07.0 # parent of 03:
03: # affected devices of 07.0
# change the entries above for your system
# you will have more devices here, a parent bridge
# followed by the bus of the affected GPU, 0f:, 10:, 0e:, 0d:
)

i=0

while [ $i -lt ${#DEVS[@]} ]; do
setpci -s ${DEVS[$i]} 3e.w=40:40 # Set 2ndary bus reset bit
sleep 0.2
setpci -s ${DEVS[$i]} 3e.w=0:40 # Clear 2ndary bus reset bit
sleep 1
# when this reports abnormally, we've failed
lspci -s ${DEVS[$(( $i + 1 ))]}
i=$(( $i + 2 ))
done

Don't forget to chmod 755 the script.  Run it once and it should
produce something like this (of course with your GPUs instead of my
NICs):

# ./reset.sh 
01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)
01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 
10GbE SFP+ (rev 01)
03:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 
10GbE SFP+ (rev 01)

If that works, then run it 100 times:

# for i in $(seq 1 100); do ./reset.sh; done

If you start seeing "(rev ff) (prog-if ff)" then the device has
failed.  (left as an exercise to the reader to automatically stop on
this condition ;)  Please report what you find and remember that it's
expected that you will need to reboot the system after performing this
test to get the devices back into a workable state.  We're not saving
and restoring the state of the devices around reset.  Thanks,

Alex

___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users


Re: [vfio-users] Bus reset trouble with Titan-X (was Re: Welcome to the "vfio-users" mailing list (Digest mode))

2016-10-17 Thread Kevin Vasko
Thanks. I'm an idiot. I just replied to the email directly after the
subscription and wasn't paying attention. Thank you for correcting it.

I was originally running 3.13.0-86-generic upgraded to the 3.19 version to
try before I posted this, but got the same results. I'll try a newer
version of the kernel and see what happens.

Sorry to be dense but what do you mean by "retrain properly"? I assume you
mean that once it fails to reset it just never recovers?

We have 2 other machines that I've never seen this problem with so what
what you are saying makes sense. This system does have a slightly more
specialized PCI bus to be able to stick 8 cards on a single bus (at least
that is my understanding), so at this point, either I'm hitting a bug that
is fixed in the kernel, or this PCI bus is not doing something that
vfio-pci is expecting (would be my speculation).

I'll report back my findings tomorrow.

Thanks for the help.

-Kevin






On Mon, Oct 17, 2016 at 5:53 PM, Alex Williamson  wrote:

> (generally a good idea to have a useful subject line)
>
> On Mon, 17 Oct 2016 16:26:15 -0500
> Kevin Vasko  wrote:
> >
> > Any suggestions on debugging a !!! Unknown header type 7f?
> >
>
> This usually means that the device didn't come back from bus reset and
> re-reading the PCI config space where the device was just gives a -1
> response.  lspci tries to interpret that bogus data and gives results
> like you see.  You might try a newer kernel, we've probably fixed some
> things in the bus reset path since v3.19.  It looks like you continue
> to see the bogus data once it gets into this state, so it's probably
> not a "simple" device coming out of reset too slowly problem.  Possibly
> the PCIe link doesn't retrain properly sometimes after a bus reset.  If
> a new kernel doesn't help, I could give you instructions for performing
> a bus reset with setpci and you could test how reliably you can reset
> the device and read config space after.  Thanks,
>
> Alex
>
___
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users