[ovirt-users] Re: vGPU VM not starting

Callum Smith Mon, 21 May 2018 01:57:07 -0700

Dear All,

I'm still having the same problems, is this a bug or something that's 
configured incorrectly?


Regards,
Callum

--

Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. cal...@well.ox.ac.uk<mailto:cal...@well.ox.ac.uk>

On 18 May 2018, at 13:22, Callum Smith 
<cal...@well.ox.ac.uk<mailto:cal...@well.ox.ac.uk>> wrote:

Yep, creating the mdev manually works, and in fact like I said previously, the 
VM does actually create an mdev successfully as you can see the UUID of the 
device (and is correctly identifiable though the 
/sys/class/mdev_bus/${DEVICE_ADDR}/${UUID}/mdev_type/name

In this specific case to help with the logs, the uuid generated is consistently 
the similar (even after manual deletion) of 
"f5dc8396-dad5-3893-9eb4-94eedf60a881"

The VM then fails to start because of the MTU issue. Restarting the VM on the 
node then produces the issue of the device not being available (because the 
device with the previous uuid exists and it's of max_instance=1). So it's the 
first VM start with the MTU issue that needs resolving, with the added 
complication that the issue of MTU (network) is caused by the mdev being set. 
The same error does not happen when mdev is not set.

PS. In fact this was the guide i followed, so thank you Martin for writing it, 
without it getting this far would have been very difficult:
https://mpolednik.github.io/2017/09/13/vgpu-in-ovirt/

Regards,
Callum

--

Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. cal...@well.ox.ac.uk<mailto:cal...@well.ox.ac.uk>

On 18 May 2018, at 13:05, Martin Polednik 
<mpoled...@redhat.com<mailto:mpoled...@redhat.com>> wrote:

On 18/05/18 13:42 +0200, Francesco Romani wrote:
Hi,


On 05/17/2018 10:56 AM, Callum Smith wrote:
In an attempt not to mislead you guys as well, there appears to be a
separate, vGPU specific, issue.

https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0

I've uploaded the full vdsm.log to dropbox. Most recently I tried
unmounting alll network devices from the VM and booting it and i get a
different issue around the vGPU:

2018-05-17 09:48:24,806+0100 INFO  (vm/1bc9dae8) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_hos
tedengine: rc=0 err= (hooks:110)
2018-05-17 09:48:24,953+0100 INFO  (vm/1bc9dae8) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_vfi
o_mdev: rc=1 err=vgpu: No device with type nvidia-61 is available.
 (hooks:110)
2018-05-17 09:48:25,069+0100 INFO  (vm/1bc9dae8) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_vho
stmd: rc=0 err= (hooks:110)
2018-05-17 09:48:25,070+0100 ERROR (vm/1bc9dae8) [virt.vm]
(vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0
') The vm start process failed (vm:943)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872,
in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2862,
in _run
    self._custom)
  File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line
153, in before_vm_start
    return _runHooksDir(domxml, 'before_vm_start', vmconf=vmconf)
  File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line
120, in _runHooksDir
    raise exception.HookError(err)
HookError: Hook Error: ('',)

Despite the nvidia-61 being an option on the
GPU: https://pastebin.com/bucw21DG

Let's tackle one issue at time :)
>From the shared logs, the VM start failed because of

2018-05-17 10:11:12,681+0100 INFO  (vm/1bc9dae8) [root] 
/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=0 err= (hooks:110)
2018-05-17 10:11:12,837+0100 INFO  (vm/1bc9dae8) [root] 
/usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device 
with type nvidia-53 is available.

maybe Martin can shed some light here?

Given that the actual slice is available in sysfs (as indicated by one
of the other branches of this thread), I fear we may be facing some
weird issue with the driver itself.
Can you create the mdev manually?

$ uuidgen >
/sys/class/mdev_bus/${DEVICE_ADDR}/mdev_supported_types/nvidia-61

should be enough for a test.

Callum, please share Vdsm logs showing the network failure


Bests,

--
Francesco Romani
Senior SW Eng., Virtualization R&D
Red Hat
IRC: fromani github: @fromanirh

_______________________________________________
Users mailing list -- users@ovirt.org<mailto:users@ovirt.org>
To unsubscribe send an email to 
users-le...@ovirt.org<mailto:users-le...@ovirt.org>

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org

[ovirt-users] Re: vGPU VM not starting

Reply via email to