Hi,

I'm seeing a weird problem in a Ubuntu (14.04.4) VM that does not always happen. This error only happens when we attach 2 GRID K2 PCI devices, when this error occurs the second NVIDIA device has no nvidia driver attached to it the first card has the driver loaded. After we (hard) reboot the VM it might come up fine the next time and both K2 devices have the nvidia module loaded or it might behave exactly the same, it seems to be quite random.

We are using the cards mostly for GPU calculations and 3D visualization of scientific data so we're not building a virtual windows game PC :)

So the question is, what could cause this error? As it does not happen every time I guess it must have something to do with the order of the modules being loaded, but that's just a guess.

Hardware:
Fujitsu PRIMERGY CX2570 M1
2 x NVIDIA GRID K2 (4 PCI devices)
2 x Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz (HT enabled)
256GB DDR4 ECC RAM
Intel C610/X99 series chipset

HOST OS: Fedora22
Kernel: 4.3.4-200.fc22.x86_64
Qemu/KVM: qemu-kvm-2.4.1-1 (virt-preview repo)

GRUB:
GRUB_CMDLINE_LINUX="nomodeset selinux=disabled elevator=deadline rd.driver.pre=vfio-pci rd.driver.blacklist=nouveau intel_iommu=on"

/etc/libvirt/qemu.conf
user  = "qemu"
group = "qemu"
clear_emulator_capabilities = 0
dynamic_ownership = 0
cgroup_controllers = [ "cpu", "cpuacct", "cpuset" ]
max_files = 100000

cgroup_device_acl = [
"/dev/null", "/dev/full", "/dev/zero",
"/dev/random", "/dev/urandom",
"/dev/ptmx", "/dev/kvm", "/dev/kqemu",
"/dev/rtc","/dev/hpet", "/dev/vfio/vfio",
"/dev/vfio/45", "/dev/vfio/46", "/dev/vfio/58",
"/dev/vfio/59"
]

/etc/udev/rules.d/10-qemu-hw-users.rules
KERNEL=="45", SUBSYSTEM=="vfio", OWNER="qemu", GROUP="qemu", MODE="0660"
KERNEL=="46", SUBSYSTEM=="vfio", OWNER="qemu", GROUP="qemu", MODE="0660"
KERNEL=="58", SUBSYSTEM=="vfio", OWNER="qemu", GROUP="qemu", MODE="0660"
KERNEL=="59", SUBSYSTEM=="vfio", OWNER="qemu", GROUP="qemu", MODE="0660"
KERNEL=="vfio" SUBSYSTEM=="misc", OWNER="qemu", GROUP="qemu", MODE=0660"


/etc/modprobe.d/blacklist.conf:
# disable for grid K2
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

/usr/local/bin/vfio-bind:
#!/bin/sh
#
modprobe vfio-pci
for dev in "$@"; do
vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
device=$(cat /sys/bus/pci/devices/$dev/device)

if [ -e /sys/bus/pci/devices/$dev/driver ]; then
echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
fi
echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
done

/etc/sysconfig/vfio-bind
DEVICES="0000:04:00.0 0000:05:00.0 0000:84:00.0 0000:85:00.0"

/etc/systemd/system/vfio-bind.service
[Unit]
Description=Binds devices to vfio-pci
After=syslog.target

[Service]
EnvironmentFile=-/etc/sysconfig/vfio-bind
Type=oneshot
RemainAfterExit=yes
ExecStart=-/usr/local/bin/vfio-bind $DEVICES

[Install]
WantedBy=multi-user.target

Stacktrace:
Feb 26 11:36:39 k2-test kernel: [ 1.923024] BUG: unable to handle kernel NULL pointer dereference at (null) Feb 26 11:36:39 k2-test kernel: [ 1.923031] IP: [<ffffffff817b669c>] __down_common+0x45/0x10e Feb 26 11:36:39 k2-test kernel: [ 1.923032] PGD 42a88c067 PUD 42a885067 PMD 0
Feb 26 11:36:39 k2-test kernel: [ 1.923034] Oops: 0002 [#1] SMP
Feb 26 11:36:39 k2-test kernel: [ 1.923042] Modules linked in: crc32_pclmul(+) ghash_clmulni_intel(-) aesni_intel aes_x86_64 ppdev lrw gf128mul glue_helper nvidia(POE+) ablk_helper cryptd serio_raw 8250_fintek parport_pc ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt mac_hid i2c_piix4 lp parport nls_utf8 isofs floppy psmouse pata_acpi Feb 26 11:36:39 k2-test kernel: [ 1.923045] CPU: 2 PID: 560 Comm: nvidia-persiste Tainted: P OE 3.19.0-51-generic #57~14.04.1-Ubuntu Feb 26 11:36:39 k2-test kernel: [ 1.923046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150715_102347- 04/01/2014 Feb 26 11:36:39 k2-test kernel: [ 1.923047] task: ffff88042d0844b0 ti: ffff88042b210000 task.ti: ffff88042b210000 Feb 26 11:36:39 k2-test kernel: [ 1.923049] RIP: 0010:[<ffffffff817b669c>] [<ffffffff817b669c>] __down_common+0x45/0x10e Feb 26 11:36:39 k2-test kernel: [ 1.923050] RSP: 0018:ffff88042b213ad8 EFLAGS: 00010096 Feb 26 11:36:39 k2-test kernel: [ 1.923050] RAX: 0000000000000000 RBX: ffffffffc1435540 RCX: ffffffffc1435548 Feb 26 11:36:39 k2-test kernel: [ 1.923051] RDX: ffff88042b213ae8 RSI: 0000000000000002 RDI: ffffffffc1435540 Feb 26 11:36:39 k2-test kernel: [ 1.923052] RBP: ffff88042b213b38 R08: 000000000001d850 R09: ffffffffc1163d4b Feb 26 11:36:39 k2-test kernel: [ 1.923052] R10: 0000000000000020 R11: 00000000000000ff R12: 7fffffffffffffff Feb 26 11:36:39 k2-test kernel: [ 1.923052] R13: ffff88042d0844b0 R14: 0000000000000002 R15: 0000000000000000 Feb 26 11:36:39 k2-test kernel: [ 1.923053] FS: 00007fa750b87740(0000) GS:ffff88043fc80000(0000) knlGS:0000000000000000 Feb 26 11:36:39 k2-test kernel: [ 1.923054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 26 11:36:39 k2-test kernel: [ 1.923055] CR2: 0000000000000000 CR3: 000000042a886000 CR4: 00000000001407e0
Feb 26 11:36:39 k2-test kernel: [ 1.923058] Stack:
Feb 26 11:36:39 k2-test kernel: [ 1.923059] 0000000000000000 00000000000200da ffffffffc1435548 0000000000000000 Feb 26 11:36:39 k2-test kernel: [ 1.923061] 0000000000000000 00000000000000d0 00000000000000d0 ffffffffc1435540 Feb 26 11:36:39 k2-test kernel: [ 1.923062] ffff88042b388000 0000000000000003 ffff88042aa78a98 0000000000000002
Feb 26 11:36:39 k2-test kernel: [ 1.923062] Call Trace:
Feb 26 11:36:39 k2-test kernel: [ 1.923065] [<ffffffff817b6782>] __down+0x1d/0x1f Feb 26 11:36:39 k2-test kernel: [ 1.923070] [<ffffffff810bb971>] down+0x41/0x50 Feb 26 11:36:39 k2-test kernel: [ 1.923142] [<ffffffffc1164087>] nvidia_open+0x3c7/0x9c0 [nvidia] Feb 26 11:36:39 k2-test kernel: [ 1.923176] [<ffffffffc1162ded>] nvidia_frontend_open+0x4d/0xa0 [nvidia] Feb 26 11:36:39 k2-test kernel: [ 1.923179] [<ffffffff811f117f>] chrdev_open+0x9f/0x1d0 Feb 26 11:36:39 k2-test kernel: [ 1.923181] [<ffffffff811e9c37>] do_dentry_open+0x1f7/0x340 Feb 26 11:36:39 k2-test kernel: [ 1.923182] [<ffffffff811f10e0>] ? cdev_put+0x30/0x30 Feb 26 11:36:39 k2-test kernel: [ 1.923184] [<ffffffff811eb487>] vfs_open+0x57/0x60 Feb 26 11:36:39 k2-test kernel: [ 1.923186] [<ffffffff811fb3dc>] do_last+0x4ec/0x1190 Feb 26 11:36:39 k2-test kernel: [ 1.923188] [<ffffffff811fc100>] path_openat+0x80/0x600 Feb 26 11:36:39 k2-test kernel: [ 1.923191] [<ffffffff810d629d>] ? call_rcu_sched+0x1d/0x20 Feb 26 11:36:39 k2-test kernel: [ 1.923195] [<ffffffff81075ffa>] ? release_task+0x38a/0x470 Feb 26 11:36:39 k2-test kernel: [ 1.923196] [<ffffffff811fd81a>] do_filp_open+0x3a/0x90 Feb 26 11:36:39 k2-test kernel: [ 1.923199] [<ffffffff8120a407>] ? __alloc_fd+0xa7/0x130 Feb 26 11:36:39 k2-test kernel: [ 1.923200] [<ffffffff811eb809>] do_sys_open+0x129/0x280 Feb 26 11:36:39 k2-test kernel: [ 1.923202] [<ffffffff81075b80>] ? task_stopped_code+0x60/0x60 Feb 26 11:36:39 k2-test kernel: [ 1.923203] [<ffffffff811eb97e>] SyS_open+0x1e/0x20 Feb 26 11:36:39 k2-test kernel: [ 1.923206] [<ffffffff817b874d>] system_call_fastpath+0x16/0x1b Feb 26 11:36:39 k2-test kernel: [ 1.923216] Code: 55 65 4c 8b 2c 25 00 b9 00 00 41 54 49 89 d4 48 8d 55 b0 53 48 89 fb 48 83 ec 38 48 8b 47 10 48 89 4d b0 48 89 57 10 48 89 45 b8 <48> 89 10 48 89 f0 83 e0 01 4c 89 6d c0 c6 45 c8 00 48 89 45 a8 Feb 26 11:36:39 k2-test kernel: [ 1.923218] RIP [<ffffffff817b669c>] __down_common+0x45/0x10e
Feb 26 11:36:39 k2-test kernel: [ 1.923218] RSP <ffff88042b213ad8>
Feb 26 11:36:39 k2-test kernel: [ 1.923219] CR2: 0000000000000000
The VM has the following relevant NVIDA (cuda) drivers installed:

ii cuda-nvrtc-7-5 7.5-18 amd64 NVRTC native runtime libraries ii cuda-nvrtc-dev-7-5 7.5-18 amd64 NVRTC native dev links, headers ii libxnvctrl0 352.79-0ubuntu1 amd64 NV-CONTROL X extension (runtime library) ii nvidia-352 352.79-0ubuntu1 amd64 NVIDIA binary driver - version 352.79 ii nvidia-352-dev 352.79-0ubuntu1 amd64 NVIDIA binary Xorg driver development files ii nvidia-352-uvm 352.79-0ubuntu1 amd64 Transitional package for nvidia-352 ii nvidia-modprobe 352.79-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files ii nvidia-opencl-icd-352 352.79-0ubuntu1 amd64 NVIDIA OpenCL ICD ii nvidia-settings 352.79-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver

VM kernel:
uname -a
Linux k2-test kernel 3.19.0-51-generic #57~14.04.1-Ubuntu SMP Fri Feb 19 14:36:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

XML VM:
<domain type='kvm' id='8'>
  <name>one-221</name>
  <uuid>0f415850-451e-465b-8ad4-bb6cd84209d2</uuid>
  <metadata>
    <system_datastore>/var/lib/one//datastores/109/221</system_datastore>
  </metadata>
  <memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <vcpu placement='static'>8</vcpu>
  <cputune>
    <shares>8192</shares>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.4'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
  </features>
  <cpu mode='host-passthrough'/>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/var/lib/one//datastores/109/221/disk.0'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/one//datastores/109/221/disk.1'/>
      <backingStore/>
      <target dev='hda' bus='ide'/>
      <readonly/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0'>
      <alias name='usb'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='ide' index='0'>
      <alias name='ide'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='bridge'>
      <mac address='04:09:92:65:3b:1d'/>
      <source bridge='ovsbridge0'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='0576023f-b955-4d46-8129-bbcb5e26dfa2'/>
      </virtualport>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='6121' autoport='no' listen='0.0.0.0'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1'/>
      <alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x84' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x85' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>
  </devices>
</domain>



--
Kind regards,

Martijn Kint
Systeem Expert Big Data Services & HPC Cloud
e-mail: [email protected] | M: +31 6 16 38 64 69
SURFsara | Science Park 140 | 1098 XG Amsterdam

_______________________________________________
vfio-users mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/vfio-users

Reply via email to