[ovirt-users] Re: [EXTERNAL] Re: Storage Domain won't activate

2020-09-04 Thread Gillingham, Eric J (US 393D) via Users
On 9/4/20, 2:26 PM, "Nir Soffer"  wrote:
On Fri, Sep 4, 2020 at 5:43 PM Gillingham, Eric J (US 393D) via Users
 wrote:
>
> On 9/4/20, 4:50 AM, "Vojtech Juranek"  wrote:
>
> On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) 
via Users
> wrote:
>
> how do you remove the fist host, did you put it into maintenance 
first? I
> wonder, how this situation (two lockspaces with conflicting names) 
can occur.
>
> You can try to re-initialize the lockspace directly using sanlock 
command (see
> man sanlock), but it would be good to understand the situation first.
>
>
> Just as you said, put into maintenance mode, shut it down, removed it via 
the engine UI.

Eric, it is possible that you shutdown the host too quickly, before it 
actually
disconnected from the lockspace?

When engine move a host to maintenance, it does not wait until the host 
actually
move into maintenance. This is actually a bug, so it would be good idea to 
file
a bug about this.


That is a possibility, from the UI view it usually takes a bit for the host to 
show is in maintenance, so I assumed it was an accurate representation of the 
state. Unfortunately all hosts have since been completely wiped and 
re-installed, this issue  brought down the entire cluster for over a day so I 
needed to get everything up again ASAP.

I did not archive/backup the sanlock logs beforehand, so I can't check for the 
sanlock events David mentioned. When I cleared the sanlock there were no s or r 
entries listed in sanlock client status, and there were no other running hosts 
to obtain other locks, but I don’t fully grok sanlock if there was maybe some 
lock that existed only on the iscsi space separate from any current or past 
hosts.


___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/O7LLWCIC76RPOXA4DCE2NTPWAZEBE6FK/


[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Arman Khalatyan
same here ☺️, on Monday will check them.

Michael Jones  schrieb am Fr., 4. Sept. 2020, 22:01:

> Yea pass through, I think vgpu you have to pay for driver upgrade with
> nvidia, I've not tried that and don't know the price, didn't find getting
> info on it easy last time I tried.
>
> Have used in both legacy and uefi boot machines, don't know the chipsets
> off the top of my head, will look on Monday.
>
>
> On Fri, 4 Sep 2020, 20:56 Vinícius Ferrão, 
> wrote:
>
>> Thanks Michael and Arman.
>>
>> To make things clear, you guys are using Passthrough, right? It’s not
>> vGPU. The 4x GPUs are added on the “Host Devices” tab of the VM.
>> What I’m trying to achieve is add the 4x V100 directly to one specific VM.
>>
>> And finally can you guys confirm which BIOS type is being used in your
>> machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with
>> legacy, perhaps I’ll give it a try.
>>
>> Thanks again.
>>
>> On 4 Sep 2020, at 14:09, Michael Jones  wrote:
>>
>> Also use multiple t4, also p4, titans, no issues but never used the nvlink
>>
>> On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:
>>
>>> hi,
>>> with the 2xT4 we haven't seen any trouble. we have no nvlink there.
>>>
>>> did u try to disable the nvlink?
>>>
>>>
>>>
>>> Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
>>> 2020, 08:39:
>>>
 Hello, here we go again.

 I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
 single VM; but things aren’t that good. Only one GPU shows up on the VM.
 lspci is able to show the GPUs, but three of them are unusable:

 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)

 There are some errors on dmesg, regarding a misconfigured BIOS:

 [   27.295972] nvidia: loading out-of-tree module taints kernel.
 [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
 [   27.295981] Disabling lock debugging due to kernel taint
 [   27.304180] nvidia: module verification failed: signature and/or
 required key missing - tainting kernel
 [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
 device number 241
 [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
 [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
 [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.579566] nvidia: probe of :09:00.0 failed with error -1
 [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
 [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
 [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
 [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
 [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
 [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module
 450.51.06  Sun Jul 19 20:02:54 UTC 2020
 [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting
 Driver for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020

 The host is Secure Intel Skylake (x86_64). VM is running with Q35
 Chipset with UEFI (pc-q35-rhel8.2.0)

 I’ve tried to change the I/O mapping options on the host, tried with
 56TB and 12TB without success. Same results. Didn’t tried with 512GB since
 the machine have 768GB of system RAM.

 Tried blacklisting the nouveau on the host, nothing.
 Installed NVIDIA drivers on the host, nothing.

 In the host I can use the 4x V100, but inside a single VM it’s
 impossible.

 Any suggestions?



 ___
 Users mailing list -- users@ovirt.org
 To unsubscribe send an email to users-le...@ovirt.org
 Privacy Statement: https://www.ovirt.org/privacy-policy.html
 oVirt Code of Conduct:
 https://www.ovirt.org/community/about/community-guidelines/
 List Archives:
 https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/

>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>> 

[ovirt-users] Re: [EXTERNAL] Re: Storage Domain won't activate

2020-09-04 Thread David Teigland
On Sat, Sep 05, 2020 at 12:25:45AM +0300, Nir Soffer wrote:
> > > /var/log/sanlock.log contains a repeating:
> > > add_lockspace
> > > 
> > e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf
> > > febbf/ids:0 conflicts with name of list1 s1
> > > 
> > e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf
> > > febbf/ids:0
> 
> David, what does this message mean?
> 
> It is clear that there is a conflict, but not clear what is the
> conflicting item. The host id in the
> request is 1, and in the conflicting item, 3. No conflicting data is
> displayed in the error message.

The lockspace being added is already being managed by sanlock, but using
host_id 3.  sanlock.log should show when lockspace e1270474 with host_id 3
was added.

Dave
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/47LHIPALKTJE4FG3OBNCN23H7SPUSQYE/


[ovirt-users] Re: [EXTERNAL] Re: Storage Domain won't activate

2020-09-04 Thread Nir Soffer
On Fri, Sep 4, 2020 at 5:43 PM Gillingham, Eric J (US 393D) via Users
 wrote:
>
> On 9/4/20, 4:50 AM, "Vojtech Juranek"  wrote:
>
> On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) via 
> Users
> wrote:
> > I recently removed a host from my cluster to upgrade it to 4.4, after I
> > removed the host from the datacenter VMs started to pause on the second
> > system they all migrated to. Investigating via the engine showed the
> > storage domain was showing as "unknown", when I try to activate it via 
> the
> > engine it cycles to locked then to unknown again.
>
> > /var/log/sanlock.log contains a repeating:
> > add_lockspace
> > 
> e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf
> > febbf/ids:0 conflicts with name of list1 s1
> > 
> e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf
> > febbf/ids:0

David, what does this message mean?

It is clear that there is a conflict, but not clear what is the
conflicting item. The host id in the
request is 1, and in the conflicting item, 3. No conflicting data is
displayed in the error message.

> how do you remove the fist host, did you put it into maintenance first? I
> wonder, how this situation (two lockspaces with conflicting names) can 
> occur.
>
> You can try to re-initialize the lockspace directly using sanlock command 
> (see
> man sanlock), but it would be good to understand the situation first.
>
>
> Just as you said, put into maintenance mode, shut it down, removed it via the 
> engine UI.

Eric, it is possible that you shutdown the host too quickly, before it actually
disconnected from the lockspace?

When engine move a host to maintenance, it does not wait until the host actually
move into maintenance. This is actually a bug, so it would be good idea to file
a bug about this.

Nir
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/DF5JRTXDQOHVTVNQ7BT3SF564GSQA4ZX/


[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Michael Jones
First things I'd check would be what driver is on host and that it's all
nvidia driver all the way make sure nouveau is blacklisted throughout

On Fri, 4 Sep 2020, 21:01 Michael Jones,  wrote:

> Yea pass through, I think vgpu you have to pay for driver upgrade with
> nvidia, I've not tried that and don't know the price, didn't find getting
> info on it easy last time I tried.
>
> Have used in both legacy and uefi boot machines, don't know the chipsets
> off the top of my head, will look on Monday.
>
>
> On Fri, 4 Sep 2020, 20:56 Vinícius Ferrão, 
> wrote:
>
>> Thanks Michael and Arman.
>>
>> To make things clear, you guys are using Passthrough, right? It’s not
>> vGPU. The 4x GPUs are added on the “Host Devices” tab of the VM.
>> What I’m trying to achieve is add the 4x V100 directly to one specific VM.
>>
>> And finally can you guys confirm which BIOS type is being used in your
>> machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with
>> legacy, perhaps I’ll give it a try.
>>
>> Thanks again.
>>
>> On 4 Sep 2020, at 14:09, Michael Jones  wrote:
>>
>> Also use multiple t4, also p4, titans, no issues but never used the nvlink
>>
>> On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:
>>
>>> hi,
>>> with the 2xT4 we haven't seen any trouble. we have no nvlink there.
>>>
>>> did u try to disable the nvlink?
>>>
>>>
>>>
>>> Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
>>> 2020, 08:39:
>>>
 Hello, here we go again.

 I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
 single VM; but things aren’t that good. Only one GPU shows up on the VM.
 lspci is able to show the GPUs, but three of them are unusable:

 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)

 There are some errors on dmesg, regarding a misconfigured BIOS:

 [   27.295972] nvidia: loading out-of-tree module taints kernel.
 [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
 [   27.295981] Disabling lock debugging due to kernel taint
 [   27.304180] nvidia: module verification failed: signature and/or
 required key missing - tainting kernel
 [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
 device number 241
 [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
 [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
 [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.579566] nvidia: probe of :09:00.0 failed with error -1
 [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
 [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
 [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
 [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
 [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
 [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module
 450.51.06  Sun Jul 19 20:02:54 UTC 2020
 [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting
 Driver for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020

 The host is Secure Intel Skylake (x86_64). VM is running with Q35
 Chipset with UEFI (pc-q35-rhel8.2.0)

 I’ve tried to change the I/O mapping options on the host, tried with
 56TB and 12TB without success. Same results. Didn’t tried with 512GB since
 the machine have 768GB of system RAM.

 Tried blacklisting the nouveau on the host, nothing.
 Installed NVIDIA drivers on the host, nothing.

 In the host I can use the 4x V100, but inside a single VM it’s
 impossible.

 Any suggestions?



 ___
 Users mailing list -- users@ovirt.org
 To unsubscribe send an email to users-le...@ovirt.org
 Privacy Statement: https://www.ovirt.org/privacy-policy.html
 oVirt Code of Conduct:
 https://www.ovirt.org/community/about/community-guidelines/
 List Archives:
 https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/

>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an 

[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Michael Jones
Yea pass through, I think vgpu you have to pay for driver upgrade with
nvidia, I've not tried that and don't know the price, didn't find getting
info on it easy last time I tried.

Have used in both legacy and uefi boot machines, don't know the chipsets
off the top of my head, will look on Monday.


On Fri, 4 Sep 2020, 20:56 Vinícius Ferrão, 
wrote:

> Thanks Michael and Arman.
>
> To make things clear, you guys are using Passthrough, right? It’s not
> vGPU. The 4x GPUs are added on the “Host Devices” tab of the VM.
> What I’m trying to achieve is add the 4x V100 directly to one specific VM.
>
> And finally can you guys confirm which BIOS type is being used in your
> machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with
> legacy, perhaps I’ll give it a try.
>
> Thanks again.
>
> On 4 Sep 2020, at 14:09, Michael Jones  wrote:
>
> Also use multiple t4, also p4, titans, no issues but never used the nvlink
>
> On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:
>
>> hi,
>> with the 2xT4 we haven't seen any trouble. we have no nvlink there.
>>
>> did u try to disable the nvlink?
>>
>>
>>
>> Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
>> 2020, 08:39:
>>
>>> Hello, here we go again.
>>>
>>> I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
>>> single VM; but things aren’t that good. Only one GPU shows up on the VM.
>>> lspci is able to show the GPUs, but three of them are unusable:
>>>
>>> 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>>> (rev a1)
>>> 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>>> (rev a1)
>>> 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>>> (rev a1)
>>> 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>>> (rev a1)
>>>
>>> There are some errors on dmesg, regarding a misconfigured BIOS:
>>>
>>> [   27.295972] nvidia: loading out-of-tree module taints kernel.
>>> [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
>>> [   27.295981] Disabling lock debugging due to kernel taint
>>> [   27.304180] nvidia: module verification failed: signature and/or
>>> required key missing - tainting kernel
>>> [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
>>> device number 241
>>> [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
>>> [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device
>>> is invalid:
>>>NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
>>> [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
>>> [   27.579566] nvidia: probe of :09:00.0 failed with error -1
>>> [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device
>>> is invalid:
>>>NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
>>> [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
>>> [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
>>> [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device
>>> is invalid:
>>>NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
>>> [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
>>> [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
>>> [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
>>> [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module
>>> 450.51.06  Sun Jul 19 20:02:54 UTC 2020
>>> [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
>>> for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020
>>>
>>> The host is Secure Intel Skylake (x86_64). VM is running with Q35
>>> Chipset with UEFI (pc-q35-rhel8.2.0)
>>>
>>> I’ve tried to change the I/O mapping options on the host, tried with
>>> 56TB and 12TB without success. Same results. Didn’t tried with 512GB since
>>> the machine have 768GB of system RAM.
>>>
>>> Tried blacklisting the nouveau on the host, nothing.
>>> Installed NVIDIA drivers on the host, nothing.
>>>
>>> In the host I can use the 4x V100, but inside a single VM it’s
>>> impossible.
>>>
>>> Any suggestions?
>>>
>>>
>>>
>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/
>>>
>> ___
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct:
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIO4DIVUU4JWG5FXYW3NQSVXCFZWYV26/
>>
>
>

[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Vinícius Ferrão via Users
Thanks Michael and Arman.

To make things clear, you guys are using Passthrough, right? It’s not vGPU. The 
4x GPUs are added on the “Host Devices” tab of the VM.
What I’m trying to achieve is add the 4x V100 directly to one specific VM.

And finally can you guys confirm which BIOS type is being used in your 
machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with legacy, 
perhaps I’ll give it a try.

Thanks again.

On 4 Sep 2020, at 14:09, Michael Jones 
mailto:m...@mikejonesey.co.uk>> wrote:

Also use multiple t4, also p4, titans, no issues but never used the nvlink

On Fri, 4 Sep 2020, 16:02 Arman Khalatyan, 
mailto:arm2...@gmail.com>> wrote:
hi,
with the 2xT4 we haven't seen any trouble. we have no nvlink there.

did u try to disable the nvlink?



Vinícius Ferrão via Users mailto:users@ovirt.org>> schrieb am 
Fr., 4. Sept. 2020, 08:39:
Hello, here we go again.

I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a single 
VM; but things aren’t that good. Only one GPU shows up on the VM. lspci is able 
to show the GPUs, but three of them are unusable:

08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)

There are some errors on dmesg, regarding a misconfigured BIOS:

[   27.295972] nvidia: loading out-of-tree module taints kernel.
[   27.295980] nvidia: module license 'NVIDIA' taints kernel.
[   27.295981] Disabling lock debugging due to kernel taint
[   27.304180] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
[   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 241
[   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
[   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
[   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
[   27.579566] nvidia: probe of :09:00.0 failed with error -1
[   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
[   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
[   27.580734] nvidia: probe of :0a:00.0 failed with error -1
[   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
[   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
[   27.581305] nvidia: probe of :0b:00.0 failed with error -1
[   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
[   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.51.06  Sun 
Jul 19 20:02:54 UTC 2020
[   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for 
UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020

The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset with 
UEFI (pc-q35-rhel8.2.0)

I’ve tried to change the I/O mapping options on the host, tried with 56TB and 
12TB without success. Same results. Didn’t tried with 512GB since the machine 
have 768GB of system RAM.

Tried blacklisting the nouveau on the host, nothing.
Installed NVIDIA drivers on the host, nothing.

In the host I can use the 4x V100, but inside a single VM it’s impossible.

Any suggestions?



___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to 
users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to 
users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIO4DIVUU4JWG5FXYW3NQSVXCFZWYV26/

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/FY5J2VGAZXUOE3K5QJIS3ETXP76M3CHO/


[ovirt-users] Re: [EXTERNAL] Re: Storage Domain won't activate

2020-09-04 Thread Gillingham, Eric J (US 393D) via Users
This is using iscsi storage. I stopped the ovirt broker/agents/vdsm and used 
sanlock to remove the locks it was complaining about, but as soon as I started 
the ovirt tools up and the engine came online again the same messages 
reappeared.

After spending more than a day trying to resolve this nicely I gave up. 
Installed ovirt-node on the host I originally removed, added that to the 
cluster, then removed and nuked the misbehaving host and did a clean install 
there. I did run into an issue where the first host had an empty 
hosted-engine.conf, only had the cert and the id settings in it so it wouldn’t 
connect properly, but I worked around that by just copying the fully populated 
one from the semi-working host and changing the id to match.
No idea if this is the right solution but it _seems_ to be working and my VMs 
are back to running, just got too frustrated trying to debug through normal 
methods and finding solutions offered via the ovirt tools and documentation.

- Eric

On 9/4/20, 10:59 AM, "Strahil Nikolov"  wrote:

Is this a HCI setup ?
If yes, check gluster status (I prefer cli but is also valid in the UI).

gluster pool list
gluster volume status

gluster volume heal  info summary

Best Regards,
Strahil Nikolov






В петък, 4 септември 2020 г., 00:38:13 Гринуич+3, Gillingham, Eric J (US 
393D) via Users  написа: 





I recently removed a host from my cluster to upgrade it to 4.4, after I 
removed the host from the datacenter VMs started to pause on the second system 
they all migrated to. Investigating via the engine showed the storage domain 
was showing as "unknown", when I try to activate it via the engine it cycles to 
locked then to unknown again.

/var/log/sanlock.log contains a repeating:
add_lockspace 
e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0
 conflicts with name of list1 s1 
e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0


vdsm.log contains these (maybe related) snippets:
---
2020-09-03 20:19:53,483+ INFO  (jsonrpc/6) [vdsm.api] FINISH 
getAllTasksStatuses error=Secured object is not in safe state 
from=:::137.79.52.43,36326, flow_id=18031a91, 
task_id=8e92f059-743a-48c8-aa9d-e7c4c836337b (api:52)
2020-09-03 20:19:53,483+ ERROR (jsonrpc/6) [storage.TaskManager.Task] 
(Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, 
in _run
return fn(*args, **kargs)
  File "", line 2, in getAllTasksStatuses
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in 
method
ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2201, 
in getAllTasksStatuses
allTasksStatus = self._pool.getAllTasksStatuses()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 
77, in wrapper
raise SecureError("Secured object is not in safe state")
SecureError: Secured object is not in safe state
2020-09-03 20:19:53,483+ INFO  (jsonrpc/6) [storage.TaskManager.Task] 
(Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') aborting: Task is aborted: 
u'Secured object is not in safe state' - code 100 (task:1181)
2020-09-03 20:19:53,483+ ERROR (jsonrpc/6) [storage.Dispatcher] FINISH 
getAllTasksStatuses error=Secured object is not in safe state (dispatcher:87)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 
74, in wrapper
result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, 
in wrapper
return m(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, 
in prepare
raise self.error
SecureError: Secured object is not in safe state
---
2020-09-03 20:44:23,252+ INFO  (tasks/2) 
[storage.ThreadPool.WorkerThread] START task 
76415a77-9d29-4b72-ade1-53207cfc503b (cmd=>, args=None) (thre
adPool:208)
2020-09-03 20:44:23,266+ INFO  (tasks/2) [storage.SANLock] Acquiring 
host id for domain e1270474-108c-4cae-83d6-51698cffebbf (id=1, wait=True) 
(clusterlock:313)
2020-09-03 20:44:23,267+ ERROR (tasks/2) [storage.TaskManager.Task] 
(Task='76415a77-9d29-4b72-ade1-53207cfc503b') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, 
in _run
return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, 
in run
return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 317, in 
startSpm
self.masterDomain.acquireHostId(self.id)
  File 

[ovirt-users] Re: VM HostedEngine is down with error

2020-09-04 Thread Strahil Nikolov via Users
Hi Maria,

I am quite puzzled about:

>/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x555e3137101b] ) 0-: received 
>signum (15), shutting down
[2020-08-27 15:35:14.890471] E [MSGID: 100018] 
[glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/run/gluster/vols/data/ov-no1.a
riadne-t.local-gluster_bricks-data-data.pid lock failed [Resource temporarily 
unavailable]

That doesn't make sense.

Can you share the logs  in a separate thread at gluster-us...@gluster.org ?


Best Regards,
Strahil Nikolov




В петък, 4 септември 2020 г., 19:55:24 Гринуич+3, souvaliotima...@mail.com 
 написа: 





Hello, 

This is what I could gather from the gluster logs around the time frame of the 
HE shutdown.

NODE1:
[root@ov-no1 glusterfs]# more 
bricks/gluster_bricks-vmstore-vmstore.log-20200830 |egrep "( W | E )"|more
[2020-08-27 15:35:03.090477] W [glusterfsd.c:1570:cleanup_and_exit] 
(-->/lib64/libpthread.so.0(+0x7dd5) [0x7fa6e04a3dd5] 
-->/usr/sbin/glusterfsd(glus
terfs_sigwaiter+0xe5) [0x55a40138d1b5] 
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55a40138d01b] ) 0-: received 
signum (15), shutting down
[2020-08-27 15:35:14.926794] E [MSGID: 100018] 
[glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/run/gluster/vols/vmstore/ov-no
1.ariadne-t.local-gluster_bricks-vmstore-vmstore.pid lock failed [Resource 
temporarily unavailable]


[root@ov-no1 glusterfs]# more bricks/gluster_bricks-data-data.log-20200830 
|egrep "( W | E )"|more
[2020-08-27 15:35:01.087875] W [glusterfsd.c:1570:cleanup_and_exit] 
(-->/lib64/libpthread.so.0(+0x7dd5) [0x7fc3cbf69dd5] 
-->/usr/sbin/glusterfsd(glus
terfs_sigwaiter+0xe5) [0x555e313711b5] 
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x555e3137101b] ) 0-: received 
signum (15), shutting down
[2020-08-27 15:35:14.890471] E [MSGID: 100018] 
[glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/run/gluster/vols/data/ov-no1.a
riadne-t.local-gluster_bricks-data-data.pid lock failed [Resource temporarily 
unavailable]


[root@ov-no1 glusterfs]# more bricks/gluster_bricks-engine-engine.log-20200830 
|egrep "( W | E )"|more
[2020-08-27 15:35:02.088732] W [glusterfsd.c:1570:cleanup_and_exit] 
(-->/lib64/libpthread.so.0(+0x7dd5) [0x7f70b99cbdd5] 
-->/usr/sbin/glusterfsd(glus
terfs_sigwaiter+0xe5) [0x55ebd132b1b5] 
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55ebd132b01b] ) 0-: received 
signum (15), shutting down
[2020-08-27 15:35:14.907603] E [MSGID: 100018] 
[glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/run/gluster/vols/engine/ov-no1
.ariadne-t.local-gluster_bricks-engine-engine.pid lock failed [Resource 
temporarily unavailable]


[root@ov-no1 glusterfs]# more bricks/gluster_bricks-vmstore-vmstore.log |egrep 
"( W | E )"|more
[nothing in the output]

[root@ov-no1 glusterfs]# more bricks/gluster_bricks-data-data.log |egrep "( W | 
E )"|more
[nothing in the output]

[root@ov-no1 glusterfs]# more bricks/gluster_bricks-engine-engine.log |egrep "( 
W | E )"|more
[nothing in the output]


[root@ov-no1 glusterfs]# more cmd_history.log | egrep "(WARN|error|fail)" |more
[2020-09-01 02:00:38.685251]  : volume geo-replication status : FAILED : Commit 
failed on ov-no2.ariadne-t.local. Please check log file for details.
Commit failed on ov-no3.ariadne-t.local. Please check log file for details.
[2020-09-01 03:02:39.094984]  : volume geo-replication status : FAILED : Commit 
failed on ov-no2.ariadne-t.local. Please check log file for details.
Commit failed on ov-no3.ariadne-t.local. Please check log file for details.
[2020-09-01 11:18:32.510224]  : volume geo-replication status : FAILED : Commit 
failed on ov-no2.ariadne-t.local. Please check log file for details.
Commit failed on ov-no3.ariadne-t.local. Please check log file for details.
[2020-09-01 14:24:33.778942]  : volume geo-replication status : FAILED : Commit 
failed on ov-no2.ariadne-t.local. Please check log file for details.
Commit failed on ov-no3.ariadne-t.local. Please check log file for details.




[root@ov-no1 glusterfs]# cat glusterd.log | egrep "( W | E )" |more
[2020-09-01 07:00:31.326169] E [glusterd-op-sm.c:8132:glusterd_op_sm] 
(-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e]
-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] 
-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x4306f) [0x7f23
d8ae806f] ) 0-management: Unable to get transaction opinfo for transaction ID 
:435d3780-aa0c-4a64-bc28-56ae394159d0
[2020-09-01 08:02:31.551563] E [glusterd-op-sm.c:8132:glusterd_op_sm] 
(-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e]
-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] 
-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x4306f) [0x7f23
d8ae806f] ) 0-management: Unable to get transaction opinfo for transaction ID 
:930a8a08-1044-41cf-b921-913b982e0c72
[2020-09-01 09:04:31.786157] E 

[ovirt-users] Re: Storage Domain won't activate

2020-09-04 Thread Strahil Nikolov via Users
Is this a HCI setup ?
If yes, check gluster status (I prefer cli but is also valid in the UI).

gluster pool list
gluster volume status

gluster volume heal  info summary

Best Regards,
Strahil Nikolov






В петък, 4 септември 2020 г., 00:38:13 Гринуич+3, Gillingham, Eric J (US 393D) 
via Users  написа: 





I recently removed a host from my cluster to upgrade it to 4.4, after I removed 
the host from the datacenter VMs started to pause on the second system they all 
migrated to. Investigating via the engine showed the storage domain was showing 
as "unknown", when I try to activate it via the engine it cycles to locked then 
to unknown again.

/var/log/sanlock.log contains a repeating:
add_lockspace 
e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0
 conflicts with name of list1 s1 
e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cffebbf/ids:0


vdsm.log contains these (maybe related) snippets:
---
2020-09-03 20:19:53,483+ INFO  (jsonrpc/6) [vdsm.api] FINISH 
getAllTasksStatuses error=Secured object is not in safe state 
from=:::137.79.52.43,36326, flow_id=18031a91, 
task_id=8e92f059-743a-48c8-aa9d-e7c4c836337b (api:52)
2020-09-03 20:19:53,483+ ERROR (jsonrpc/6) [storage.TaskManager.Task] 
(Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in 
_run
    return fn(*args, **kargs)
  File "", line 2, in getAllTasksStatuses
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2201, in 
getAllTasksStatuses
    allTasksStatus = self._pool.getAllTasksStatuses()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 77, 
in wrapper
    raise SecureError("Secured object is not in safe state")
SecureError: Secured object is not in safe state
2020-09-03 20:19:53,483+ INFO  (jsonrpc/6) [storage.TaskManager.Task] 
(Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') aborting: Task is aborted: 
u'Secured object is not in safe state' - code 100 (task:1181)
2020-09-03 20:19:53,483+ ERROR (jsonrpc/6) [storage.Dispatcher] FINISH 
getAllTasksStatuses error=Secured object is not in safe state (dispatcher:87)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, 
in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in 
wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in 
prepare
    raise self.error
SecureError: Secured object is not in safe state
---
2020-09-03 20:44:23,252+ INFO  (tasks/2) [storage.ThreadPool.WorkerThread] 
START task 76415a77-9d29-4b72-ade1-53207cfc503b (cmd=>, args=None) (thre
adPool:208)
2020-09-03 20:44:23,266+ INFO  (tasks/2) [storage.SANLock] Acquiring host 
id for domain e1270474-108c-4cae-83d6-51698cffebbf (id=1, wait=True) 
(clusterlock:313)
2020-09-03 20:44:23,267+ ERROR (tasks/2) [storage.TaskManager.Task] 
(Task='76415a77-9d29-4b72-ade1-53207cfc503b') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in 
_run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 317, in 
startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 957, in 
acquireHostId
    self._manifest.acquireHostId(hostId, wait)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 501, in 
acquireHostId
    self._domainLock.acquireHostId(hostId, wait)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 
344, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: 
('e1270474-108c-4cae-83d6-51698cffebbf', SanlockException(22, 'Sanlock 
lockspace add failure', 'Invalid argument'))
---

Another symptom is in the hosts view of the engine SPM bounces between "Normal" 
and "Contending". When it's Normal if I select Management -> Select as SPM I 
get "Error while executing action: Cannot force select SPM. Unknown Data Center 
status."

I've tried rebooting the one remaining host in the cluster no to avail, 
hosted-engine --reinitialize-lockspace also seems to not solve the issue.


I'm kind of stumped as to what else to try, would appreciate any guidance on 
how to resolve this.

Thank You

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: 

[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Michael Jones
Also use multiple t4, also p4, titans, no issues but never used the nvlink

On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:

> hi,
> with the 2xT4 we haven't seen any trouble. we have no nvlink there.
>
> did u try to disable the nvlink?
>
>
>
> Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
> 2020, 08:39:
>
>> Hello, here we go again.
>>
>> I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
>> single VM; but things aren’t that good. Only one GPU shows up on the VM.
>> lspci is able to show the GPUs, but three of them are unusable:
>>
>> 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>> (rev a1)
>> 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>> (rev a1)
>> 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>> (rev a1)
>> 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>> (rev a1)
>>
>> There are some errors on dmesg, regarding a misconfigured BIOS:
>>
>> [   27.295972] nvidia: loading out-of-tree module taints kernel.
>> [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
>> [   27.295981] Disabling lock debugging due to kernel taint
>> [   27.304180] nvidia: module verification failed: signature and/or
>> required key missing - tainting kernel
>> [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
>> device number 241
>> [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
>> [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device
>> is invalid:
>>NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
>> [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
>> [   27.579566] nvidia: probe of :09:00.0 failed with error -1
>> [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device
>> is invalid:
>>NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
>> [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
>> [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
>> [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device
>> is invalid:
>>NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
>> [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
>> [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
>> [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
>> [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.51.06
>> Sun Jul 19 20:02:54 UTC 2020
>> [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
>> for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020
>>
>> The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset
>> with UEFI (pc-q35-rhel8.2.0)
>>
>> I’ve tried to change the I/O mapping options on the host, tried with 56TB
>> and 12TB without success. Same results. Didn’t tried with 512GB since the
>> machine have 768GB of system RAM.
>>
>> Tried blacklisting the nouveau on the host, nothing.
>> Installed NVIDIA drivers on the host, nothing.
>>
>> In the host I can use the 4x V100, but inside a single VM it’s impossible.
>>
>> Any suggestions?
>>
>>
>>
>> ___
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct:
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/
>>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIO4DIVUU4JWG5FXYW3NQSVXCFZWYV26/
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZOMK6ULEK3IXNC3TQV5TYIY5SH23NNA4/


[ovirt-users] Re: VM HostedEngine is down with error

2020-09-04 Thread souvaliotimaria
Hello, 

This is what I could gather from the gluster logs around the time frame of the 
HE shutdown.

NODE1:
[root@ov-no1 glusterfs]# more 
bricks/gluster_bricks-vmstore-vmstore.log-20200830 |egrep "( W | E )"|more
[2020-08-27 15:35:03.090477] W [glusterfsd.c:1570:cleanup_and_exit] 
(-->/lib64/libpthread.so.0(+0x7dd5) [0x7fa6e04a3dd5] 
-->/usr/sbin/glusterfsd(glus
terfs_sigwaiter+0xe5) [0x55a40138d1b5] 
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55a40138d01b] ) 0-: received 
signum (15), shutting down
[2020-08-27 15:35:14.926794] E [MSGID: 100018] 
[glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/run/gluster/vols/vmstore/ov-no
1.ariadne-t.local-gluster_bricks-vmstore-vmstore.pid lock failed [Resource 
temporarily unavailable]


[root@ov-no1 glusterfs]# more bricks/gluster_bricks-data-data.log-20200830 
|egrep "( W | E )"|more
[2020-08-27 15:35:01.087875] W [glusterfsd.c:1570:cleanup_and_exit] 
(-->/lib64/libpthread.so.0(+0x7dd5) [0x7fc3cbf69dd5] 
-->/usr/sbin/glusterfsd(glus
terfs_sigwaiter+0xe5) [0x555e313711b5] 
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x555e3137101b] ) 0-: received 
signum (15), shutting down
[2020-08-27 15:35:14.890471] E [MSGID: 100018] 
[glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/run/gluster/vols/data/ov-no1.a
riadne-t.local-gluster_bricks-data-data.pid lock failed [Resource temporarily 
unavailable]


[root@ov-no1 glusterfs]# more bricks/gluster_bricks-engine-engine.log-20200830 
|egrep "( W | E )"|more
[2020-08-27 15:35:02.088732] W [glusterfsd.c:1570:cleanup_and_exit] 
(-->/lib64/libpthread.so.0(+0x7dd5) [0x7f70b99cbdd5] 
-->/usr/sbin/glusterfsd(glus
terfs_sigwaiter+0xe5) [0x55ebd132b1b5] 
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55ebd132b01b] ) 0-: received 
signum (15), shutting down
[2020-08-27 15:35:14.907603] E [MSGID: 100018] 
[glusterfsd.c:2333:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/run/gluster/vols/engine/ov-no1
.ariadne-t.local-gluster_bricks-engine-engine.pid lock failed [Resource 
temporarily unavailable]


[root@ov-no1 glusterfs]# more bricks/gluster_bricks-vmstore-vmstore.log |egrep 
"( W | E )"|more
[nothing in the output]

[root@ov-no1 glusterfs]# more bricks/gluster_bricks-data-data.log |egrep "( W | 
E )"|more
[nothing in the output]

[root@ov-no1 glusterfs]# more bricks/gluster_bricks-engine-engine.log |egrep "( 
W | E )"|more
[nothing in the output]


[root@ov-no1 glusterfs]# more cmd_history.log | egrep "(WARN|error|fail)" |more
[2020-09-01 02:00:38.685251]  : volume geo-replication status : FAILED : Commit 
failed on ov-no2.ariadne-t.local. Please check log file for details.
Commit failed on ov-no3.ariadne-t.local. Please check log file for details.
[2020-09-01 03:02:39.094984]  : volume geo-replication status : FAILED : Commit 
failed on ov-no2.ariadne-t.local. Please check log file for details.
Commit failed on ov-no3.ariadne-t.local. Please check log file for details.
[2020-09-01 11:18:32.510224]  : volume geo-replication status : FAILED : Commit 
failed on ov-no2.ariadne-t.local. Please check log file for details.
Commit failed on ov-no3.ariadne-t.local. Please check log file for details.
[2020-09-01 14:24:33.778942]  : volume geo-replication status : FAILED : Commit 
failed on ov-no2.ariadne-t.local. Please check log file for details.
Commit failed on ov-no3.ariadne-t.local. Please check log file for details.




[root@ov-no1 glusterfs]# cat glusterd.log | egrep "( W | E )" |more
[2020-09-01 07:00:31.326169] E [glusterd-op-sm.c:8132:glusterd_op_sm] 
(-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e]
 -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] 
-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x4306f) [0x7f23
d8ae806f] ) 0-management: Unable to get transaction opinfo for transaction ID 
:435d3780-aa0c-4a64-bc28-56ae394159d0
[2020-09-01 08:02:31.551563] E [glusterd-op-sm.c:8132:glusterd_op_sm] 
(-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e]
 -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] 
-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x4306f) [0x7f23
d8ae806f] ) 0-management: Unable to get transaction opinfo for transaction ID 
:930a8a08-1044-41cf-b921-913b982e0c72
[2020-09-01 09:04:31.786157] E [glusterd-op-sm.c:8132:glusterd_op_sm] 
(-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e]
 -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] 
-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x4306f) [0x7f23
d8ae806f] ) 0-management: Unable to get transaction opinfo for transaction ID 
:9942b579-5240-4fee-bb4c-78b9a1c98da8
[2020-09-01 10:06:32.014362] E [glusterd-op-sm.c:8132:glusterd_op_sm] 
(-->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x23a1e) [0x7f23d8ac8a1e]
 -->/usr/lib64/glusterfs/6.4/xlator/mgmt/glusterd.so(+0x1c1be) [0x7f23d8ac11be] 

[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Arman Khalatyan
hi,
with the 2xT4 we haven't seen any trouble. we have no nvlink there.

did u try to disable the nvlink?



Vinícius Ferrão via Users  schrieb am Fr., 4. Sept. 2020,
08:39:

> Hello, here we go again.
>
> I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
> single VM; but things aren’t that good. Only one GPU shows up on the VM.
> lspci is able to show the GPUs, but three of them are unusable:
>
> 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
> (rev a1)
> 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
> (rev a1)
> 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
> (rev a1)
> 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
> (rev a1)
>
> There are some errors on dmesg, regarding a misconfigured BIOS:
>
> [   27.295972] nvidia: loading out-of-tree module taints kernel.
> [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
> [   27.295981] Disabling lock debugging due to kernel taint
> [   27.304180] nvidia: module verification failed: signature and/or
> required key missing - tainting kernel
> [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
> device number 241
> [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
> [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device is
> invalid:
>NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
> [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.579566] nvidia: probe of :09:00.0 failed with error -1
> [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device is
> invalid:
>NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
> [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
> [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device is
> invalid:
>NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
> [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
> [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
> [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.51.06
> Sun Jul 19 20:02:54 UTC 2020
> [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
> for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020
>
> The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset
> with UEFI (pc-q35-rhel8.2.0)
>
> I’ve tried to change the I/O mapping options on the host, tried with 56TB
> and 12TB without success. Same results. Didn’t tried with 512GB since the
> machine have 768GB of system RAM.
>
> Tried blacklisting the nouveau on the host, nothing.
> Installed NVIDIA drivers on the host, nothing.
>
> In the host I can use the 4x V100, but inside a single VM it’s impossible.
>
> Any suggestions?
>
>
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIO4DIVUU4JWG5FXYW3NQSVXCFZWYV26/


[ovirt-users] Re: [EXTERNAL] Re: Storage Domain won't activate

2020-09-04 Thread Gillingham, Eric J (US 393D) via Users
On 9/4/20, 4:50 AM, "Vojtech Juranek"  wrote:

On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) via 
Users 
wrote:
> I recently removed a host from my cluster to upgrade it to 4.4, after I
> removed the host from the datacenter VMs started to pause on the second
> system they all migrated to. Investigating via the engine showed the
> storage domain was showing as "unknown", when I try to activate it via the
> engine it cycles to locked then to unknown again.

> /var/log/sanlock.log contains a repeating:
> add_lockspace
> 
e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf
> febbf/ids:0 conflicts with name of list1 s1
> 
e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf
> febbf/ids:0

how do you remove the fist host, did you put it into maintenance first? I 
wonder, how this situation (two lockspaces with conflicting names) can 
occur.

You can try to re-initialize the lockspace directly using sanlock command 
(see 
man sanlock), but it would be good to understand the situation first.


Just as you said, put into maintenance mode, shut it down, removed it via the 
engine UI.



___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/CDLMSS336Z46BNZ4K4IAWO6JBYAHAFDO/


[ovirt-users] Re: Storage Domain won't activate

2020-09-04 Thread Vojtech Juranek
On čtvrtek 3. září 2020 22:49:17 CEST Gillingham, Eric J (US 393D) via Users 
wrote:
> I recently removed a host from my cluster to upgrade it to 4.4, after I
> removed the host from the datacenter VMs started to pause on the second
> system they all migrated to. Investigating via the engine showed the
> storage domain was showing as "unknown", when I try to activate it via the
> engine it cycles to locked then to unknown again.
 
> /var/log/sanlock.log contains a repeating:
> add_lockspace
> e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf
> febbf/ids:0 conflicts with name of list1 s1
> e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf
> febbf/ids:0

how do you remove the fist host, did you put it into maintenance first? I 
wonder, how this situation (two lockspaces with conflicting names) can occur.

You can try to re-initialize the lockspace directly using sanlock command (see 
man sanlock), but it would be good to understand the situation first.


> 
> vdsm.log contains these (maybe related) snippets:
> ---
> 2020-09-03 20:19:53,483+ INFO  (jsonrpc/6) [vdsm.api] FINISH
> getAllTasksStatuses error=Secured object is not in safe state
> from=:::137.79.52.43,36326, flow_id=18031a91,
> task_id=8e92f059-743a-48c8-aa9d-e7c4c836337b (api:52)
 2020-09-03
> 20:19:53,483+ ERROR (jsonrpc/6) [storage.TaskManager.Task]
> (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') Unexpected error (task:875)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
> _run
 return fn(*args, **kargs)
>   File "", line 2, in getAllTasksStatuses
>   File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
> method
 ret = func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2201, in
> getAllTasksStatuses
 allTasksStatus = self._pool.getAllTasksStatuses()
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line
> 77, in wrapper
 raise SecureError("Secured object is not in safe state")
> SecureError: Secured object is not in safe state
> 2020-09-03 20:19:53,483+ INFO  (jsonrpc/6) [storage.TaskManager.Task]
> (Task='8e92f059-743a-48c8-aa9d-e7c4c836337b') aborting: Task is aborted:
> u'Secured object is not in safe state' - code 100 (task:1181)
 2020-09-03
> 20:19:53,483+ ERROR (jsonrpc/6) [storage.Dispatcher] FINISH
> getAllTasksStatuses error=Secured object is not in safe state
> (dispatcher:87) Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line
> 74, in wrapper
 result = ctask.prepare(func, *args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in
> wrapper
 return m(self, *a, **kw)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189,
> in prepare
 raise self.error
> SecureError: Secured object is not in safe state
> ---
> 2020-09-03 20:44:23,252+ INFO  (tasks/2)
> [storage.ThreadPool.WorkerThread] START task
> 76415a77-9d29-4b72-ade1-53207cfc503b (cmd= >, args=None) (thre
> adPool:208)
> 2020-09-03 20:44:23,266+ INFO  (tasks/2) [storage.SANLock] Acquiring
> host id for domain e1270474-108c-4cae-83d6-51698cffebbf (id=1, wait=True)
> (clusterlock:313)
 2020-09-03 20:44:23,267+ ERROR (tasks/2)
> [storage.TaskManager.Task] (Task='76415a77-9d29-4b72-ade1-53207cfc503b')
> Unexpected error (task:875) Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
> _run
 return fn(*args, **kargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in
> run
 return self.cmd(*self.argslist, **self.argsdict)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 317, in
> startSpm
 self.masterDomain.acquireHostId(self.id)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 957, in
> acquireHostId
 self._manifest.acquireHostId(hostId, wait)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 501, in
> acquireHostId
 self._domainLock.acquireHostId(hostId, wait)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line
> 344, in acquireHostId
 raise se.AcquireHostIdFailure(self._sdUUID, e)
> AcquireHostIdFailure: Cannot acquire host id:
> ('e1270474-108c-4cae-83d6-51698cffebbf', SanlockException(22, 'Sanlock
> lockspace add failure', 'Invalid argument'))
 ---
> 
> Another symptom is in the hosts view of the engine SPM bounces between
> "Normal" and "Contending". When it's Normal if I select Management ->
> Select as SPM I get "Error while executing action: Cannot force select SPM.
> Unknown Data Center status."
 
> I've tried rebooting the one remaining host in the cluster no to avail,
> hosted-engine --reinitialize-lockspace also seems to not solve the issue.
 
> 
> I'm kind of stumped as to what else to try, would appreciate any guidance on
> how to resolve this.
 
> Thank 

[ovirt-users] Re: Failed to connect to server (code: 1006) connecting to second host noVNC console

2020-09-04 Thread James Loker-Steele via Users
Ok, i have resolved this issue

On cluster, i had to disable Encrypted VNC and reinstall the host to apply the 
changes.

This would probably not happen if it was a shared cluster. Right now its 2 
local clusters
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/MM752ERPAM4XTB5QCKGIZ5DMOEA7MBVP/


[ovirt-users] Re: Hosted engine install failure: ipv6.gateway: gateway cannot be set if there are no addresses configured

2020-09-04 Thread Dominik Holler
Sverker, is this bug blocking you, or can you work around it?

On Thu, Sep 3, 2020 at 8:52 PM Dominik Holler  wrote:

> Sverker, thanks!
>
> On Thu, Sep 3, 2020 at 6:50 PM Sverker Abrahamsson <
> sver...@abrahamsson.com> wrote:
>
>> Hi Dominik,
>> bug filed at https://bugzilla.redhat.com/show_bug.cgi?id=1875520. I'm
>> doing a new install to get fresh vdsm and supervdsm logs which will be
>> attached as soon as they've failed.
>> /Sverker
>> Den 2020-09-03 kl. 18:03, skrev Dominik Holler:
>>
>>
>>
>> On Thu, Sep 3, 2020 at 12:42 PM Sverker Abrahamsson <
>> sver...@abrahamsson.com> wrote:
>>
>>> Hi Ales,
>>> this is a CentOS 8 so my impression was that you always have
>>> NetworkManager then? At least my attempt to remove it failed miserably.
>>>
>>
>> Yes, on CentOS 8 hosts oVirt requires the interfaces managed by
>> NetworkManager.
>>
>>
>>> The enp4s0 config was created by the install, so it should be controlled
>>> by NetworkManager.
>>>
>>
>> This should work. Can you please report a bug on vdsm [1]?
>> Would be helpful if the vdsm.log and supervdsm.log would be attached to
>> this bug.
>>
>> [1]
>>   https://bugzilla.redhat.com/enter_bug.cgi?product=vdsm
>>
>>
>>
>>> /Sverker
>>> Den 2020-09-03 kl. 12:29, skrev Ales Musil:
>>>
>>>
>>>
>>> On Thu, Sep 3, 2020 at 12:21 PM Sverker Abrahamsson <
>>> sver...@abrahamsson.com> wrote:
>>>
 Hi Ales,
 right now I have a manually created ovirtmgmt bridge (virbr0 and vnet0
 seems to be created during the failed attempt to deploy hosted engine):

 [root@h1-mgmt ~]# nmcli con show
 NAME  UUID  TYPE  DEVICE
 enp4s0af7ccb53-011b-4c36-998a-1878b4ae7100  ethernet  enp4s0
 Bridge ovirtmgmt  9a0b07c0-2983-fe97-ec7f-ad2b51c3a3f0  bridge
 ovirtmgmt
 virbr0aa593151-2c12-4cf7-985b-f105b3575d09  bridgevirbr0
 enp4s0.4000   ecc8064d-18c1-99b7-3fe4-9c5a593ece6f  vlan
 enp4s0.4000
 vnet0 a6db45bd-93c8-4c37-85fc-0c58ba3e9d00  tun   vnet0
 [root@h1-mgmt ~]# nmstatectl show
 ---
 dns-resolver:
   config:
 search: []
 server:
 - 213.133.98.98
   running:
 search: []
 server:
 - 213.133.98.98
 route-rules:
   config: []
 routes:
   config:
   - destination: 0.0.0.0/0
 metric: -1
 next-hop-address: 144.76.84.65
 next-hop-interface: enp4s0
 table-id: 0
   - destination: ::/0
 metric: -1
 next-hop-address: fe80::1
 next-hop-interface: enp4s0
 table-id: 0
   running:
   - destination: 0.0.0.0/0
 metric: 100
 next-hop-address: 144.76.84.65
 next-hop-interface: enp4s0
 table-id: 254
   - destination: 144.76.84.65/32
 metric: 100
 next-hop-address: ''
 next-hop-interface: enp4s0
 table-id: 254
   - destination: 172.27.1.0/24
 metric: 425
 next-hop-address: ''
 next-hop-interface: ovirtmgmt
 table-id: 254
   - destination: 192.168.1.0/24
 metric: 0
 next-hop-address: ''
 next-hop-interface: virbr0
 table-id: 254
   - destination: 2a01:4f8:192:1148::/64
 metric: 100
 next-hop-address: ''
 next-hop-interface: enp4s0
 table-id: 254
   - destination: ::/0
 metric: 100
 next-hop-address: fe80::1
 next-hop-interface: enp4s0
 table-id: 254
   - destination: fe80::/64
 metric: 100
 next-hop-address: ''
 next-hop-interface: enp4s0
 table-id: 254
   - destination: ff00::/8
 metric: 256
 next-hop-address: ''
 next-hop-interface: enp4s0
 table-id: 255
 interfaces:
 - name: ;vdsmdummy;
   type: linux-bridge
   state: down
   ipv4:
 enabled: false
   ipv6:
 enabled: false
   mac-address: DE:D3:A8:24:27:F6
   mtu: 1500
 - name: br-int
   type: unknown
   state: down
   ipv4:
 enabled: false
   ipv6:
 enabled: false
   mac-address: 6E:37:94:63:E0:4B
   mtu: 1500
 - name: enp4s0
   type: ethernet
   state: up
   ethernet:
 auto-negotiation: true
 duplex: full
 speed: 1000
   ipv4:
 address:
 - ip: 144.76.84.73
   prefix-length: 32
 dhcp: false
 enabled: true
   ipv6:
 address:
 - ip: 2a01:4f8:192:1148::2
   prefix-length: 64
 - ip: fe80::62a4:4cff:fee9:4ac
   prefix-length: 64
 auto-dns: true
 auto-gateway: true
 auto-routes: true
 autoconf: true
 dhcp: true
 enabled: true
   mac-address: 60:A4:4C:E9:04:AC
   mtu: 1500
 - name: enp4s0.4000
   type: vlan
   state: up
   ipv4:
 dhcp: false
 enabled: false
   ipv6:
 

[ovirt-users] Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Vinícius Ferrão via Users
Hello, here we go again.

I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a single 
VM; but things aren’t that good. Only one GPU shows up on the VM. lspci is able 
to show the GPUs, but three of them are unusable:

08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)

There are some errors on dmesg, regarding a misconfigured BIOS:

[   27.295972] nvidia: loading out-of-tree module taints kernel.
[   27.295980] nvidia: module license 'NVIDIA' taints kernel.
[   27.295981] Disabling lock debugging due to kernel taint
[   27.304180] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
[   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 241
[   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
[   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
[   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
[   27.579566] nvidia: probe of :09:00.0 failed with error -1
[   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
[   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
[   27.580734] nvidia: probe of :0a:00.0 failed with error -1
[   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
[   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
[   27.581305] nvidia: probe of :0b:00.0 failed with error -1
[   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
[   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.51.06  Sun 
Jul 19 20:02:54 UTC 2020
[   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for 
UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020

The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset with 
UEFI (pc-q35-rhel8.2.0)

I’ve tried to change the I/O mapping options on the host, tried with 56TB and 
12TB without success. Same results. Didn’t tried with 512GB since the machine 
have 768GB of system RAM.

Tried blacklisting the nouveau on the host, nothing.
Installed NVIDIA drivers on the host, nothing.

In the host I can use the 4x V100, but inside a single VM it’s impossible.

Any suggestions?



___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/