[ovirt-users] Re: storage high latency, sanlock errors, cluster instability

2022-06-08 Thread Strahil Nikolov via Users
Usually I would blacklist the Gluster devices by creating the necessary stanzas 
in /etc/multipath/conf.d/blacklist.conf
This way you will keep the situation simple.

For your problem, it's hard to odentify the problem based on the e-mails.What 
are your symptoms ?
To debug GlusterFS, it is good to start from the brick logs 
(/var/log/gluster/bricks) and the current heal status. On a 3-way replica 
volume heals should be resolved by GlusterFS - if not, there is a bug.
Best Regards,Strahil Nikolov

Best Regards,Strahil Nikolov
 
 
  On Tue, May 31, 2022 at 16:32, jb wrote:   I still have 
the same problems, but it look like that the errors comes a 
bit less often.

I'm starting now to migrate the disk images to a NFS storage. When there 
is no other way, I would recreate the glusterFS cluster.

The problem I have is, that I don't know where is the root of this 
problem and if recreating would fix the issue in longer terms.

Am 29.05.22 um 20:26 schrieb Nir Soffer:
> On Sun, May 29, 2022 at 9:03 PM Jonathan Baecker  wrote:
>> Am 29.05.22 um 19:24 schrieb Nir Soffer:
>>
>> On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker  wrote:
>>
>> Hello everybody,
>>
>> we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem 
>> upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability.
>>
>> First I will write down the problems I had with upgrading, so you get a 
>> bigger picture:
>>
>> engine update when fine
>> But nodes I could not update because of wrong version of imgbase, so I did a 
>> manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it 
>> was still booting into 4.4.10, so I did a reinstall.
>> Then after second reboot I ended up in the emergency mode. After a long 
>> searching I figure out that lvm.conf using use_devicesfile now but there it 
>> uses the wrong filters. So I comment out this and add the old filters back. 
>> This procedure I have done on all 3 nodes.
>>
>> When use_devicesfile (default in 4.5) is enabled, lvm filter is not
>> used. During installation
>> the old lvm filter is removed.
>>
>> Can you share more info on why it does not work for you?
>>
>> The problem was, that the node could not mount the gluster volumes anymore 
>> and ended up in emergency mode.
>>
>> - output of lsblk
>>
>> NAME                                                      MAJ:MIN RM  SIZE 
>> RO TYPE  MOUNTPOINT
>> sda                                                          8:0    0  1.8T  
>> 0 disk
>> `-XA1920LE10063_HKS028AV                                  253:0    0  1.8T  
>> 0 mpath
>>    |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tmeta  253:16  0    9G  
>>0 lvm
>>    | `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18  0  1.7T  
>>0 lvm
>>    |  |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda    253:19  0  1.7T  1 
>>lvm
>>    |  |-gluster_vg_sda-gluster_lv_data                    253:20  0  100G  0 
>>lvm  /gluster_bricks/data
>>    |  `-gluster_vg_sda-gluster_lv_vmstore                  253:21  0  1.6T  
>>0 lvm  /gluster_bricks/vmstore
>>    `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tdata  253:17  0  1.7T  
>>0 lvm
>>      `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18  0  1.7T  
>>0 lvm
>>        |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda    253:19  0  1.7T  
>>1 lvm
>>        |-gluster_vg_sda-gluster_lv_data                    253:20  0  100G  
>>0 lvm  /gluster_bricks/data
>>        `-gluster_vg_sda-gluster_lv_vmstore                  253:21  0  1.6T  
>>0 lvm  /gluster_bricks/vmstore
>> sr0                                                        11:0    1  1024M  
>> 0 rom
>> nvme0n1                                                    259:0    0 238.5G 
>>  0 disk
>> |-nvme0n1p1                                                259:1    0    1G  
>> 0 part  /boot
>> |-nvme0n1p2                                                259:2    0  134G  
>> 0 part
>> | |-onn-pool00_tmeta                                      253:1    0    1G  
>> 0 lvm
>> | | `-onn-pool00-tpool                                    253:3    0    87G  
>> 0 lvm
>> | |  |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1      253:4    0    50G  
>> 0 lvm  /
>> | |  |-onn-pool00                                        253:7    0    87G  
>> 1 lvm
>> | |  |-onn-home                                          253:8    0    1G  0 
>> lvm  /home
>> | |  |-onn-tmp                                            253:9    0    1G  
>> 0 lvm  /tmp
>> | |  |-onn-var                                            253:10  0    15G  
>> 0 lvm  /var
>> | |  |-onn-var_crash                                      253:11  0    10G  
>> 0 lvm  /var/crash
>> | |  |-onn-var_log                                        253:12  0    8G  0 
>> lvm  /var/log
>> | |  |-onn-var_log_audit                                  253:13  0    2G  0 
>> lvm  /var/log/audit
>> | |  |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1      253:14  0    50G  0 
>> lvm
>> | |  `-onn-var_tmp             

[ovirt-users] Re: storage high latency, sanlock errors, cluster instability

2022-05-31 Thread jb
I still have the same problems, but it look like that the errors comes a 
bit less often.


I'm starting now to migrate the disk images to a NFS storage. When there 
is no other way, I would recreate the glusterFS cluster.


The problem I have is, that I don't know where is the root of this 
problem and if recreating would fix the issue in longer terms.


Am 29.05.22 um 20:26 schrieb Nir Soffer:

On Sun, May 29, 2022 at 9:03 PM Jonathan Baecker  wrote:

Am 29.05.22 um 19:24 schrieb Nir Soffer:

On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker  wrote:

Hello everybody,

we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem 
upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability.

First I will write down the problems I had with upgrading, so you get a bigger 
picture:

engine update when fine
But nodes I could not update because of wrong version of imgbase, so I did a 
manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it was 
still booting into 4.4.10, so I did a reinstall.
Then after second reboot I ended up in the emergency mode. After a long 
searching I figure out that lvm.conf using use_devicesfile now but there it 
uses the wrong filters. So I comment out this and add the old filters back. 
This procedure I have done on all 3 nodes.

When use_devicesfile (default in 4.5) is enabled, lvm filter is not
used. During installation
the old lvm filter is removed.

Can you share more info on why it does not work for you?

The problem was, that the node could not mount the gluster volumes anymore and 
ended up in emergency mode.

- output of lsblk

NAME   MAJ:MIN RM   SIZE RO 
TYPE  MOUNTPOINT
sda  8:00   1.8T  0 
disk
`-XA1920LE10063_HKS028AV   253:00   1.8T  0 
mpath
   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tmeta   253:16   0 9G  
0 lvm
   | `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18   0   1.7T  
0 lvm
   |   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19   0   1.7T  
1 lvm
   |   |-gluster_vg_sda-gluster_lv_data 253:20   0   100G  
0 lvm   /gluster_bricks/data
   |   `-gluster_vg_sda-gluster_lv_vmstore  253:21   0   1.6T  
0 lvm   /gluster_bricks/vmstore
   `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tdata   253:17   0   1.7T  
0 lvm
 `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18   0   1.7T  
0 lvm
   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19   0   1.7T  
1 lvm
   |-gluster_vg_sda-gluster_lv_data 253:20   0   100G  
0 lvm   /gluster_bricks/data
   `-gluster_vg_sda-gluster_lv_vmstore  253:21   0   1.6T  
0 lvm   /gluster_bricks/vmstore
sr0 11:01  1024M  0 
rom
nvme0n1259:00 238.5G  0 
disk
|-nvme0n1p1259:10 1G  0 
part  /boot
|-nvme0n1p2259:20   134G  0 
part
| |-onn-pool00_tmeta   253:10 1G  0 
lvm
| | `-onn-pool00-tpool 253:3087G  0 
lvm
| |   |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1   253:4050G  0 
lvm   /
| |   |-onn-pool00 253:7087G  1 
lvm
| |   |-onn-home   253:80 1G  0 
lvm   /home
| |   |-onn-tmp253:90 1G  0 
lvm   /tmp
| |   |-onn-var253:10   015G  0 
lvm   /var
| |   |-onn-var_crash  253:11   010G  0 
lvm   /var/crash
| |   |-onn-var_log253:12   0 8G  0 
lvm   /var/log
| |   |-onn-var_log_audit  253:13   0 2G  0 
lvm   /var/log/audit
| |   |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1   253:14   050G  0 
lvm
| |   `-onn-var_tmp253:15   010G  0 
lvm   /var/tmp
| |-onn-pool00_tdata   253:2087G  0 
lvm
| | `-onn-pool00-tpool 253:3087G  0 
lvm
| |   |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1   253:4050G  0 
lvm   /
| |   |-onn-pool00 253:7087G  1 
lvm
| |   |-onn-home   253:80 1G  0 
lvm   /home
| |   |-onn-tmp253:90 1G  0 
lvm   /tmp
| |   |-onn-var253:10   015G  0 
lvm   /var
| |   |-onn-var_crash  253:11  

[ovirt-users] Re: storage high latency, sanlock errors, cluster instability

2022-05-29 Thread Jonathan Baecker

Am 29.05.22 um 20:26 schrieb Nir Soffer:

On Sun, May 29, 2022 at 9:03 PM Jonathan Baecker  wrote:

Am 29.05.22 um 19:24 schrieb Nir Soffer:

On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker  wrote:

Hello everybody,

we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem 
upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability.

First I will write down the problems I had with upgrading, so you get a bigger 
picture:

engine update when fine
But nodes I could not update because of wrong version of imgbase, so I did a 
manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it was 
still booting into 4.4.10, so I did a reinstall.
Then after second reboot I ended up in the emergency mode. After a long 
searching I figure out that lvm.conf using use_devicesfile now but there it 
uses the wrong filters. So I comment out this and add the old filters back. 
This procedure I have done on all 3 nodes.

When use_devicesfile (default in 4.5) is enabled, lvm filter is not
used. During installation
the old lvm filter is removed.

Can you share more info on why it does not work for you?

The problem was, that the node could not mount the gluster volumes anymore and 
ended up in emergency mode.

- output of lsblk

NAME   MAJ:MIN RM   SIZE RO 
TYPE  MOUNTPOINT
sda  8:00   1.8T  0 
disk
`-XA1920LE10063_HKS028AV   253:00   1.8T  0 
mpath
   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tmeta   253:16   0 9G  
0 lvm
   | `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18   0   1.7T  
0 lvm
   |   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19   0   1.7T  
1 lvm
   |   |-gluster_vg_sda-gluster_lv_data 253:20   0   100G  
0 lvm   /gluster_bricks/data
   |   `-gluster_vg_sda-gluster_lv_vmstore  253:21   0   1.6T  
0 lvm   /gluster_bricks/vmstore
   `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tdata   253:17   0   1.7T  
0 lvm
 `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18   0   1.7T  
0 lvm
   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19   0   1.7T  
1 lvm
   |-gluster_vg_sda-gluster_lv_data 253:20   0   100G  
0 lvm   /gluster_bricks/data
   `-gluster_vg_sda-gluster_lv_vmstore  253:21   0   1.6T  
0 lvm   /gluster_bricks/vmstore
sr0 11:01  1024M  0 
rom
nvme0n1259:00 238.5G  0 
disk
|-nvme0n1p1259:10 1G  0 
part  /boot
|-nvme0n1p2259:20   134G  0 
part
| |-onn-pool00_tmeta   253:10 1G  0 
lvm
| | `-onn-pool00-tpool 253:3087G  0 
lvm
| |   |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1   253:4050G  0 
lvm   /
| |   |-onn-pool00 253:7087G  1 
lvm
| |   |-onn-home   253:80 1G  0 
lvm   /home
| |   |-onn-tmp253:90 1G  0 
lvm   /tmp
| |   |-onn-var253:10   015G  0 
lvm   /var
| |   |-onn-var_crash  253:11   010G  0 
lvm   /var/crash
| |   |-onn-var_log253:12   0 8G  0 
lvm   /var/log
| |   |-onn-var_log_audit  253:13   0 2G  0 
lvm   /var/log/audit
| |   |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1   253:14   050G  0 
lvm
| |   `-onn-var_tmp253:15   010G  0 
lvm   /var/tmp
| |-onn-pool00_tdata   253:2087G  0 
lvm
| | `-onn-pool00-tpool 253:3087G  0 
lvm
| |   |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1   253:4050G  0 
lvm   /
| |   |-onn-pool00 253:7087G  1 
lvm
| |   |-onn-home   253:80 1G  0 
lvm   /home
| |   |-onn-tmp253:90 1G  0 
lvm   /tmp
| |   |-onn-var253:10   015G  0 
lvm   /var
| |   |-onn-var_crash  253:11   010G  0 
lvm   /var/crash
| |   |-onn-var_log253:12   0 8G  0 
lvm   /var/log
| |   |-onn-var_log_audit  253:13   0 2G  0 
lvm   /var/log/audit
| |   |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1   253:14   050G  0 
lvm
| |   `-onn-var_tmp

[ovirt-users] Re: storage high latency, sanlock errors, cluster instability

2022-05-29 Thread Nir Soffer
On Sun, May 29, 2022 at 9:03 PM Jonathan Baecker  wrote:
>
> Am 29.05.22 um 19:24 schrieb Nir Soffer:
>
> On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker  wrote:
>
> Hello everybody,
>
> we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem 
> upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability.
>
> First I will write down the problems I had with upgrading, so you get a 
> bigger picture:
>
> engine update when fine
> But nodes I could not update because of wrong version of imgbase, so I did a 
> manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it 
> was still booting into 4.4.10, so I did a reinstall.
> Then after second reboot I ended up in the emergency mode. After a long 
> searching I figure out that lvm.conf using use_devicesfile now but there it 
> uses the wrong filters. So I comment out this and add the old filters back. 
> This procedure I have done on all 3 nodes.
>
> When use_devicesfile (default in 4.5) is enabled, lvm filter is not
> used. During installation
> the old lvm filter is removed.
>
> Can you share more info on why it does not work for you?
>
> The problem was, that the node could not mount the gluster volumes anymore 
> and ended up in emergency mode.
>
> - output of lsblk
>
> NAME   MAJ:MIN RM   SIZE 
> RO TYPE  MOUNTPOINT
> sda  8:00   1.8T  
> 0 disk
> `-XA1920LE10063_HKS028AV   253:00   1.8T  
> 0 mpath
>   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tmeta   253:16   0 9G  
> 0 lvm
>   | `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18   0   1.7T  
> 0 lvm
>   |   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19   0   1.7T  
> 1 lvm
>   |   |-gluster_vg_sda-gluster_lv_data 253:20   0   100G  
> 0 lvm   /gluster_bricks/data
>   |   `-gluster_vg_sda-gluster_lv_vmstore  253:21   0   1.6T  
> 0 lvm   /gluster_bricks/vmstore
>   `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tdata   253:17   0   1.7T  
> 0 lvm
> `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18   0   1.7T  
> 0 lvm
>   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19   0   1.7T  
> 1 lvm
>   |-gluster_vg_sda-gluster_lv_data 253:20   0   100G  
> 0 lvm   /gluster_bricks/data
>   `-gluster_vg_sda-gluster_lv_vmstore  253:21   0   1.6T  
> 0 lvm   /gluster_bricks/vmstore
> sr0 11:01  1024M  
> 0 rom
> nvme0n1259:00 238.5G  
> 0 disk
> |-nvme0n1p1259:10 1G  
> 0 part  /boot
> |-nvme0n1p2259:20   134G  
> 0 part
> | |-onn-pool00_tmeta   253:10 1G  
> 0 lvm
> | | `-onn-pool00-tpool 253:3087G  
> 0 lvm
> | |   |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1   253:4050G  
> 0 lvm   /
> | |   |-onn-pool00 253:7087G  
> 1 lvm
> | |   |-onn-home   253:80 1G  
> 0 lvm   /home
> | |   |-onn-tmp253:90 1G  
> 0 lvm   /tmp
> | |   |-onn-var253:10   015G  
> 0 lvm   /var
> | |   |-onn-var_crash  253:11   010G  
> 0 lvm   /var/crash
> | |   |-onn-var_log253:12   0 8G  
> 0 lvm   /var/log
> | |   |-onn-var_log_audit  253:13   0 2G  
> 0 lvm   /var/log/audit
> | |   |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1   253:14   050G  
> 0 lvm
> | |   `-onn-var_tmp253:15   010G  
> 0 lvm   /var/tmp
> | |-onn-pool00_tdata   253:2087G  
> 0 lvm
> | | `-onn-pool00-tpool 253:3087G  
> 0 lvm
> | |   |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1   253:4050G  
> 0 lvm   /
> | |   |-onn-pool00 253:7087G  
> 1 lvm
> | |   |-onn-home   253:80 1G  
> 0 lvm   /home
> | |   |-onn-tmp253:90 1G  
> 0 lvm   /tmp
> | |   |-onn-var253:10   015G  
> 0 lvm   /var
> | |   |-onn-var_crash  253:11   010G  
> 0 lvm   /var/crash
> | |   |-onn-var_log253:12   0 8G  
> 0 lvm   /var/log
> | |   |-onn-var_log_audit  253:13   0 

[ovirt-users] Re: storage high latency, sanlock errors, cluster instability

2022-05-29 Thread Jonathan Baecker

Am 29.05.22 um 19:24 schrieb Nir Soffer:

On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker  wrote:

Hello everybody,

we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem 
upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability.

First I will write down the problems I had with upgrading, so you get a bigger 
picture:

engine update when fine
But nodes I could not update because of wrong version of imgbase, so I did a 
manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it was 
still booting into 4.4.10, so I did a reinstall.
Then after second reboot I ended up in the emergency mode. After a long 
searching I figure out that lvm.conf using use_devicesfile now but there it 
uses the wrong filters. So I comment out this and add the old filters back. 
This procedure I have done on all 3 nodes.

When use_devicesfile (default in 4.5) is enabled, lvm filter is not
used. During installation
the old lvm filter is removed.

Can you share more info on why it does not work for you?
The problem was, that the node could not mount the gluster volumes 
anymore and ended up in emergency mode.

- output of lsblk


   NAME MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
   sda 8:0    0   1.8T  0 disk
   `-XA1920LE10063_HKS028AV 253:0    0   1.8T  0 mpath
  |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tmeta 253:16  
   0 9G  0 lvm
  | `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18  
   0   1.7T  0 lvm
  |   |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19   0  
   1.7T  1 lvm
  |   |-gluster_vg_sda-gluster_lv_data 253:20   0   100G  0 lvm  
   /gluster_bricks/data
  |   `-gluster_vg_sda-gluster_lv_vmstore 253:21   0   1.6T  0
   lvm   /gluster_bricks/vmstore
  `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tdata 253:17  
   0   1.7T  0 lvm
    `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18  
   0   1.7T  0 lvm
  |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19   0  
   1.7T  1 lvm
  |-gluster_vg_sda-gluster_lv_data 253:20   0   100G  0 lvm  
   /gluster_bricks/data
  `-gluster_vg_sda-gluster_lv_vmstore 253:21   0   1.6T  0
   lvm   /gluster_bricks/vmstore
   sr0 11:0    1  1024M  0 rom
   nvme0n1 259:0    0 238.5G  0 disk
   |-nvme0n1p1 259:1    0 1G  0 part  /boot
   |-nvme0n1p2 259:2    0   134G  0 part
   | |-onn-pool00_tmeta 253:1    0 1G  0 lvm
   | | `-onn-pool00-tpool 253:3    0    87G  0 lvm
   | |   |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1 253:4    0   
   50G  0 lvm   /
   | |   |-onn-pool00 253:7    0    87G  1 lvm
   | |   |-onn-home 253:8    0 1G  0 lvm   /home
   | |   |-onn-tmp 253:9    0 1G  0 lvm   /tmp
   | |   |-onn-var 253:10   0    15G  0 lvm   /var
   | |   |-onn-var_crash 253:11   0    10G  0 lvm   /var/crash
   | |   |-onn-var_log 253:12   0 8G  0 lvm   /var/log
   | |   |-onn-var_log_audit 253:13   0 2G  0 lvm   /var/log/audit
   | |   |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1 253:14   0   
   50G  0 lvm
   | |   `-onn-var_tmp 253:15   0    10G  0 lvm   /var/tmp
   | |-onn-pool00_tdata 253:2    0    87G  0 lvm
   | | `-onn-pool00-tpool 253:3    0    87G  0 lvm
   | |   |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1 253:4    0   
   50G  0 lvm   /
   | |   |-onn-pool00 253:7    0    87G  1 lvm
   | |   |-onn-home 253:8    0 1G  0 lvm   /home
   | |   |-onn-tmp 253:9    0 1G  0 lvm   /tmp
   | |   |-onn-var 253:10   0    15G  0 lvm   /var
   | |   |-onn-var_crash 253:11   0    10G  0 lvm   /var/crash
   | |   |-onn-var_log 253:12   0 8G  0 lvm   /var/log
   | |   |-onn-var_log_audit 253:13   0 2G  0 lvm   /var/log/audit
   | |   |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1 253:14   0   
   50G  0 lvm
   | |   `-onn-var_tmp 253:15   0    10G  0 lvm   /var/tmp
   | `-onn-swap 253:5    0    20G  0 lvm   [SWAP]
   `-nvme0n1p3 259:3    0    95G  0 part
  `-gluster_vg_nvme0n1p3-gluster_lv_engine 253:6    0    94G  0
   lvm   /gluster_bricks/engine


- The old lvm filter used, and why it was needed


   filter =
   ["a|^/dev/disk/by-id/lvm-pv-uuid-Nn7tZl-TFdY-BujO-VZG5-EaGW-5YFd-Lo5pwa$|",
   "a|^/dev/disk/by-id/lvm-pv-uuid-Wcbxnx-2RhC-s1Re-s148-nLj9-Tr3f-jj4VvE$|",
   "a|^/dev/disk/by-id/lvm-pv-uuid-lX51wm-H7V4-3CTn-qYob-Rkpx-Tptd-t94jNL$|",
   "r|.*|"]

I don't remember exactly any more why it was needed, but without the 
node was not working correctly. I think I even used vdsm-tool 
config-lvm-filter.

- output of vdsm-tool config-lvm-filter


   Analyzing host...
   Found these mounted logical volumes on this host:

  logical volume: /dev/mapper/gluster_vg_nvme0n1p3-gluster_lv_engine
  mountpoint:  /gluster_bricks/engine
  devices: /dev/nvme0n1p3

  logical volume:  /dev/mapper/gluster_vg_sda-gluster_lv_data
  mountpoint:  /gluster_bricks/data
  devices: /dev/mapper/XA1920LE10063_HKS028AV

  logical volume:  /dev/mapper/gluster_vg_sda-gluster_

[ovirt-users] Re: storage high latency, sanlock errors, cluster instability

2022-05-29 Thread Nir Soffer
On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker  wrote:
>
> Hello everybody,
>
> we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem 
> upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability.
>
> First I will write down the problems I had with upgrading, so you get a 
> bigger picture:
>
> engine update when fine
> But nodes I could not update because of wrong version of imgbase, so I did a 
> manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it 
> was still booting into 4.4.10, so I did a reinstall.
> Then after second reboot I ended up in the emergency mode. After a long 
> searching I figure out that lvm.conf using use_devicesfile now but there it 
> uses the wrong filters. So I comment out this and add the old filters back. 
> This procedure I have done on all 3 nodes.

When use_devicesfile (default in 4.5) is enabled, lvm filter is not
used. During installation
the old lvm filter is removed.

Can you share more info on why it does not work for you?
- output of lsblk
- The old lvm filter used, and why it was needed
- output of vdsm-tool config-lvm-filter

If using lvm devices does not work for you, you can enable the lvm
filter in vdsm configuration
by adding a drop-in file:

$ cat /etc/vdsm/vdsm.conf.d/99-local.conf
[lvm]
config_method = filter

And run:

vdsm-tool config-lvm-filter

to configure the lvm filter in the best way for vdsm. If this does not create
the right filter we would like to know why, but in general you should use
lvm devices since it avoids the trouble of maintaining the filter and dealing
with upgrades and user edited lvm filter.

If  you disable use_devicesfile, the next vdsm upgrade will enable it
back unless
you change the configuration.

Also even if you disable use_devicesfile in lvm.conf, vdsm still use
--devices instead
of filter when running lvm commands, and lvm commands run by vdsm ignore your
lvm filter since the --devices option overrides the system settings.

...
> I notice some unsync volume warning, but because I had this in the past to, 
> after upgrading, I though after some time they will disappear. The next day 
> there still where there, so I decided to put the nodes again in the 
> maintenance mode and restart the glusterd service. After some time the sync 
> warnings where gone.

Not clear what these warnings are, I guess Gluster warning?

> So now the actual problem:
>
> Since this time the cluster is unstable. I get different errors and warning, 
> like:
>
> VM [name] is not responding
> out of nothing HA VM gets migrated
> VM migration can fail
> VM backup with snapshoting and export take very long

How do you backup the vms? do you sue a backup application? how is it
configured?

> VMs are getting very slow some times
> Storage domain vmstore experienced a high latency of 9.14251
> ovs|1|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record "." 
> column other_config
> 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success 489249
> 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0 
> /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
> 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0 
> /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
> many of: 424035 [2243175]: s27 delta_renew long write time XX sec

All these issues tell use that your storage is not working correctly.

sanlock.log is full of renewal errors form May:

$ grep 2022-05- sanlock.log | wc -l
4844

$ grep 2022-05- sanlock.log | grep 'renewal error' | wc -l
631

But there is lot of trouble from earlier months:

$ grep 2022-04- sanlock.log | wc -l
844
$ grep 2022-04- sanlock.log | grep 'renewal error' | wc -l
29

$ grep 2022-03- sanlock.log | wc -l
1609
$ grep 2022-03- sanlock.log | grep 'renewal error' | wc -l
483

$ grep 2022-02- sanlock.log | wc -l
826
$ grep 2022-02- sanlock.log | grep 'renewal error' | wc -l
242

Here sanlock log looks healthy:

$ grep 2022-01- sanlock.log | wc -l
3
$ grep 2022-01- sanlock.log | grep 'renewal error' | wc -l
0

$ grep 2021-12- sanlock.log | wc -l
48
$ grep 2021-12- sanlock.log | grep 'renewal error' | wc -l
0

vdsm log shows that 2 domains are not accessible:

$ grep ERROR vdsm.log
2022-05-29 15:07:19,048+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
(monitor:511)
2022-05-29 16:33:59,049+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
(monitor:511)
2022-05-29 16:34:39,049+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
(monitor:511)
2022-05-29 17:21:39,050+02