On Tue, Oct 18, 2016 at 11:04 AM, Kevin Vasko <kva...@gmail.com> wrote:
> Alex, > > (crossing fingers this goes into the correct thread). > > I upgraded this machine to 4.4.0-42-generic. > > I spawned a single VM with 1 GPU immediately after the kernel upgrade. It > works. It attached properly and in the VM when I ran lspci, it showed up > properly. > > I deleted that VM and started up the system with 4x GPUs, and then it > started exhibiting the same issue. Three of the GPUs attached properly. > > This appears to be that it was not resolved with upgrading the kernel. If > you don't mind providing instructions on resetting the bus to see if I can > narrow this down further (what you were talking about yesterday) that would > be appreciated. Any other suggestions would be greatly appreciated as well. > > Here are the logs of the 4 GPU attachment that failed. > > On the host. > > /etc/var/log/libvirt/qemu/instance-00000185.log > > this shows the /usr/bin/kvm command issuing the connection of the > following devices > > -device vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5 > -device vfio-pci,host=10:00.0,id=hostdev1,bus=pci.0,addr=0x6 > -device vfio-pci,host=0e:00.0,id=hostdev2,bus=pci.0,addr=0x7 > -device vfio-pci,host=0d:00.0,id=hostdev3,bus=pci.0,addr=0x8 > > > lspci -vnnn -d 10de:17c2 (on the host, I omitted the other 4 GPUs) > > > 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 > [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) > > subsystem: NVIDIA Corporation Device [10de:1132] > > Flags: fast devsel, IRQ 28 > > Memory at b9000000 (32-bit, non-prefetchable) [size=16M] > > Memory at 38ff20000000 (64-bit, prefetchable) [size=256M] > > Memory at 38ff30000000 (64-bit, prefetchable) [size=32M] > > I/O ports at 3000 [size=128] > > Expansion ROM at ba000000 [disabled] [size=512k] > > Capabilities: [60] Power Management version 3 > > Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+ > > Capabilities: [78] Express Legacy Endpoint, MSI 00 > > Capabilities: [100] Express Legacy Endpoint, MSI 00 > > Capabilities: [250] Latency Tolerance Reporting > > Capabilities: [258] L1 PM Substates > > Capabilities: [128] Power Budgeting <?> > > Capabilities: [420] Advanced Error Reporting > > Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 > Len=024 <?> > > Capabilities: [900] #19 > > Kernel driver in use: vfio-pci > > 0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 > [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) > > subsystem: NVIDIA Corporation Device [10de:1132] > > Flags: fast devsel, IRQ 28 > > Memory at b9000000 (32-bit, non-prefetchable) [size=16M] > > Memory at 38ff20000000 (64-bit, prefetchable) [size=256M] > > Memory at 38ff30000000 (64-bit, prefetchable) [size=32M] > > I/O ports at 3000 [size=128] > > Expansion ROM at ba000000 [disabled] [size=512k] > > Capabilities: [60] Power Management version 3 > > Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+ > > Capabilities: [78] Express Legacy Endpoint, MSI 00 > > Capabilities: [100] Express Legacy Endpoint, MSI 00 > > Capabilities: [250] Latency Tolerance Reporting > > Capabilities: [258] L1 PM Substates > > Capabilities: [128] Power Budgeting <?> > > Capabilities: [420] Advanced Error Reporting > > Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 > Len=024 <?> > > Capabilities: [900] #19 > > Kernel driver in use: vfio-pci > > > 0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 > [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff) > > !!! Unknown header type 7f > > Kernel driver in use: vfio-pci > > > 10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 > [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) > > subsystem: NVIDIA Corporation Device [10de:1132] > > Flags: fast devsel, IRQ 28 > > Memory at b9000000 (32-bit, non-prefetchable) [size=16M] > > Memory at 38ff20000000 (64-bit, prefetchable) [size=256M] > > Memory at 38ff30000000 (64-bit, prefetchable) [size=32M] > > I/O ports at 3000 [size=128] > > Expansion ROM at ba000000 [disabled] [size=512k] > > Capabilities: [60] Power Management version 3 > > Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+ > > Capabilities: [78] Express Legacy Endpoint, MSI 00 > > Capabilities: [100] Express Legacy Endpoint, MSI 00 > > Capabilities: [250] Latency Tolerance Reporting > > Capabilities: [258] L1 PM Substates > > Capabilities: [128] Power Budgeting <?> > > Capabilities: [420] Advanced Error Reporting > > Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 > Len=024 <?> > > Capabilities: [900] #19 > > Kernel driver in use: vfio-pci > > > On the VM guest: > > > lspci > > > 00:06.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev a1) > > 00:07.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev a1) > > 00:08.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev a1) > > dmesg > > > [ 0.787786] pci 0000:00:05.0: [10de:17c2] type 7f class 0xffffff > > [ 0.788970] pci 0000:00:06.0: [10de:17c2] type 00 class 0x030000 > > [ 0.855192] pci 0000:00:07.0: [10de:17c2] type 00 class 0x030000 > > [ 0.925003] pci 0000:00:08.0: [10de:17c2] type 00 class 0x030000 > > > > > On Mon, Oct 17, 2016 at 11:10 PM, Kevin Vasko <kva...@gmail.com> wrote: > >> Thanks. I'm an idiot. I just replied to the email directly after the >> subscription and wasn't paying attention. Thank you for correcting it. >> >> I was originally running 3.13.0-86-generic upgraded to the 3.19 version >> to try before I posted this, but got the same results. I'll try a newer >> version of the kernel and see what happens. >> >> Sorry to be dense but what do you mean by "retrain properly"? I assume >> you mean that once it fails to reset it just never recovers? >> >> We have 2 other machines that I've never seen this problem with so what >> what you are saying makes sense. This system does have a slightly more >> specialized PCI bus to be able to stick 8 cards on a single bus (at least >> that is my understanding), so at this point, either I'm hitting a bug that >> is fixed in the kernel, or this PCI bus is not doing something that >> vfio-pci is expecting (would be my speculation). >> >> I'll report back my findings tomorrow. >> >> Thanks for the help. >> >> -Kevin >> >> >> >> >> >> >> On Mon, Oct 17, 2016 at 5:53 PM, Alex Williamson < >> alex.william...@redhat.com> wrote: >> >>> (generally a good idea to have a useful subject line) >>> >>> On Mon, 17 Oct 2016 16:26:15 -0500 >>> Kevin Vasko <kva...@gmail.com> wrote: >>> > >>> > Any suggestions on debugging a !!! Unknown header type 7f? >>> > >>> >>> This usually means that the device didn't come back from bus reset and >>> re-reading the PCI config space where the device was just gives a -1 >>> response. lspci tries to interpret that bogus data and gives results >>> like you see. You might try a newer kernel, we've probably fixed some >>> things in the bus reset path since v3.19. It looks like you continue >>> to see the bogus data once it gets into this state, so it's probably >>> not a "simple" device coming out of reset too slowly problem. Possibly >>> the PCIe link doesn't retrain properly sometimes after a bus reset. If >>> a new kernel doesn't help, I could give you instructions for performing >>> a bus reset with setpci and you could test how reliably you can reset >>> the device and read config space after. Thanks, >>> >>> Alex >>> >> >> >
_______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users