On Sun, May 29, 2022 at 9:03 PM Jonathan Baecker <jonba...@gmail.com> wrote: > > Am 29.05.22 um 19:24 schrieb Nir Soffer: > > On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker <jonba...@gmail.com> wrote: > > Hello everybody, > > we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem > upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability. > > First I will write down the problems I had with upgrading, so you get a > bigger picture: > > engine update when fine > But nodes I could not update because of wrong version of imgbase, so I did a > manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it > was still booting into 4.4.10, so I did a reinstall. > Then after second reboot I ended up in the emergency mode. After a long > searching I figure out that lvm.conf using use_devicesfile now but there it > uses the wrong filters. So I comment out this and add the old filters back. > This procedure I have done on all 3 nodes. > > When use_devicesfile (default in 4.5) is enabled, lvm filter is not > used. During installation > the old lvm filter is removed. > > Can you share more info on why it does not work for you? > > The problem was, that the node could not mount the gluster volumes anymore > and ended up in emergency mode. > > - output of lsblk > > NAME MAJ:MIN RM SIZE > RO TYPE MOUNTPOINT > sda 8:0 0 1.8T > 0 disk > `-XA1920LE10063_HKS028AV 253:0 0 1.8T > 0 mpath > |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tmeta 253:16 0 9G > 0 lvm > | `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18 0 1.7T > 0 lvm > | |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19 0 1.7T > 1 lvm > | |-gluster_vg_sda-gluster_lv_data 253:20 0 100G > 0 lvm /gluster_bricks/data > | `-gluster_vg_sda-gluster_lv_vmstore 253:21 0 1.6T > 0 lvm /gluster_bricks/vmstore > `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tdata 253:17 0 1.7T > 0 lvm > `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18 0 1.7T > 0 lvm > |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19 0 1.7T > 1 lvm > |-gluster_vg_sda-gluster_lv_data 253:20 0 100G > 0 lvm /gluster_bricks/data > `-gluster_vg_sda-gluster_lv_vmstore 253:21 0 1.6T > 0 lvm /gluster_bricks/vmstore > sr0 11:0 1 1024M > 0 rom > nvme0n1 259:0 0 238.5G > 0 disk > |-nvme0n1p1 259:1 0 1G > 0 part /boot > |-nvme0n1p2 259:2 0 134G > 0 part > | |-onn-pool00_tmeta 253:1 0 1G > 0 lvm > | | `-onn-pool00-tpool 253:3 0 87G > 0 lvm > | | |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1 253:4 0 50G > 0 lvm / > | | |-onn-pool00 253:7 0 87G > 1 lvm > | | |-onn-home 253:8 0 1G > 0 lvm /home > | | |-onn-tmp 253:9 0 1G > 0 lvm /tmp > | | |-onn-var 253:10 0 15G > 0 lvm /var > | | |-onn-var_crash 253:11 0 10G > 0 lvm /var/crash > | | |-onn-var_log 253:12 0 8G > 0 lvm /var/log > | | |-onn-var_log_audit 253:13 0 2G > 0 lvm /var/log/audit > | | |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1 253:14 0 50G > 0 lvm > | | `-onn-var_tmp 253:15 0 10G > 0 lvm /var/tmp > | |-onn-pool00_tdata 253:2 0 87G > 0 lvm > | | `-onn-pool00-tpool 253:3 0 87G > 0 lvm > | | |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1 253:4 0 50G > 0 lvm / > | | |-onn-pool00 253:7 0 87G > 1 lvm > | | |-onn-home 253:8 0 1G > 0 lvm /home > | | |-onn-tmp 253:9 0 1G > 0 lvm /tmp > | | |-onn-var 253:10 0 15G > 0 lvm /var > | | |-onn-var_crash 253:11 0 10G > 0 lvm /var/crash > | | |-onn-var_log 253:12 0 8G > 0 lvm /var/log > | | |-onn-var_log_audit 253:13 0 2G > 0 lvm /var/log/audit > | | |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1 253:14 0 50G > 0 lvm > | | `-onn-var_tmp 253:15 0 10G > 0 lvm /var/tmp > | `-onn-swap 253:5 0 20G > 0 lvm [SWAP] > `-nvme0n1p3 259:3 0 95G > 0 part > `-gluster_vg_nvme0n1p3-gluster_lv_engine 253:6 0 94G > 0 lvm /gluster_bricks/engine > > - The old lvm filter used, and why it was needed > > filter = > ["a|^/dev/disk/by-id/lvm-pv-uuid-Nn7tZl-TFdY-BujO-VZG5-EaGW-5YFd-Lo5pwa$|", > "a|^/dev/disk/by-id/lvm-pv-uuid-Wcbxnx-2RhC-s1Re-s148-nLj9-Tr3f-jj4VvE$|", > "a|^/dev/disk/by-id/lvm-pv-uuid-lX51wm-H7V4-3CTn-qYob-Rkpx-Tptd-t94jNL$|", > "r|.*|"] > > I don't remember exactly any more why it was needed, but without the node was > not working correctly. I think I even used vdsm-tool config-lvm-filter.
I think that if you list the devices in this filter: ls -lh /dev/disk/by-id/lvm-pv-uuid-Nn7tZl-TFdY-BujO-VZG5-EaGW-5YFd-Lo5pwa \ /dev/disk/by-id/lvm-pv-uuid-Wcbxnx-2RhC-s1Re-s148-nLj9-Tr3f-jj4VvE \ /dev/disk/by-id/lvm-pv-uuid-lX51wm-H7V4-3CTn-qYob-Rkpx-Tptd-t94jNL You will see that these are the devices used by these vgs: gluster_vg_sda, gluster_vg_nvme0n1p3, onn > > - output of vdsm-tool config-lvm-filter > > Analyzing host... > Found these mounted logical volumes on this host: > > logical volume: /dev/mapper/gluster_vg_nvme0n1p3-gluster_lv_engine > mountpoint: /gluster_bricks/engine > devices: /dev/nvme0n1p3 > > logical volume: /dev/mapper/gluster_vg_sda-gluster_lv_data > mountpoint: /gluster_bricks/data > devices: /dev/mapper/XA1920LE10063_HKS028AV > > logical volume: /dev/mapper/gluster_vg_sda-gluster_lv_vmstore > mountpoint: /gluster_bricks/vmstore > devices: /dev/mapper/XA1920LE10063_HKS028AV > > logical volume: /dev/mapper/onn-home > mountpoint: /home > devices: /dev/nvme0n1p2 > > logical volume: /dev/mapper/onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1 > mountpoint: / > devices: /dev/nvme0n1p2 > > logical volume: /dev/mapper/onn-swap > mountpoint: [SWAP] > devices: /dev/nvme0n1p2 > > logical volume: /dev/mapper/onn-tmp > mountpoint: /tmp > devices: /dev/nvme0n1p2 > > logical volume: /dev/mapper/onn-var > mountpoint: /var > devices: /dev/nvme0n1p2 > > logical volume: /dev/mapper/onn-var_crash > mountpoint: /var/crash > devices: /dev/nvme0n1p2 > > logical volume: /dev/mapper/onn-var_log > mountpoint: /var/log > devices: /dev/nvme0n1p2 > > logical volume: /dev/mapper/onn-var_log_audit > mountpoint: /var/log/audit > devices: /dev/nvme0n1p2 > > logical volume: /dev/mapper/onn-var_tmp > mountpoint: /var/tmp > devices: /dev/nvme0n1p2 > > Configuring LVM system.devices. > Devices for following VGs will be imported: > > gluster_vg_sda, gluster_vg_nvme0n1p3, onn > > To properly configure the host, we need to add multipath > blacklist in /etc/multipath/conf.d/vdsm_blacklist.conf: > > blacklist { > wwid "eui.0025388901b1e26f" > } > > > Configure host? [yes,NO] If you run "vdsm-tool config-lvm-filter" and confirm with "yes", I think all the vgs will be imported properly into lvm devices file. I don't think it will solve the storage issues you have since Feb 2022, but at least you will have a standard configuration and the next upgrade will not revert your local settings. > If using lvm devices does not work for you, you can enable the lvm > filter in vdsm configuration > by adding a drop-in file: > > $ cat /etc/vdsm/vdsm.conf.d/99-local.conf > [lvm] > config_method = filter > > And run: > > vdsm-tool config-lvm-filter > > to configure the lvm filter in the best way for vdsm. If this does not create > the right filter we would like to know why, but in general you should use > lvm devices since it avoids the trouble of maintaining the filter and dealing > with upgrades and user edited lvm filter. > > If you disable use_devicesfile, the next vdsm upgrade will enable it > back unless > you change the configuration. > > I would be happy to just use the default, when there is a way to make > use_devicesfile to wok. > > Also even if you disable use_devicesfile in lvm.conf, vdsm still use > --devices instead > of filter when running lvm commands, and lvm commands run by vdsm ignore your > lvm filter since the --devices option overrides the system settings. > > ... > > I notice some unsync volume warning, but because I had this in the past to, > after upgrading, I though after some time they will disappear. The next day > there still where there, so I decided to put the nodes again in the > maintenance mode and restart the glusterd service. After some time the sync > warnings where gone. > > Not clear what these warnings are, I guess Gluster warning? > > Yes was Gluster warnings under Storage -> Volumes it was saying that some > entries are unsync. > > So now the actual problem: > > Since this time the cluster is unstable. I get different errors and warning, > like: > > VM [name] is not responding > out of nothing HA VM gets migrated > VM migration can fail > VM backup with snapshoting and export take very long > > How do you backup the vms? do you sue a backup application? how is it > configured? > > I use a self made plython script, which uses the rest api. I create a > snapshot from the VM, build a new VM from that snapshot and move the new one > to the export domain. This is not very efficient - this copy the entire vm at the point of time of the snapshot and then copy it again to the export domain. If you use a backup application supporting the incremental backup API, the first full backup will copy the entire vm once, but later incremental backup will copy only the changes since the last backup. > > VMs are getting very slow some times > Storage domain vmstore experienced a high latency of 9.14251 > ovs|00001|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record "." > column other_config > 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success 489249 > 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0 > /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids > 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0 > /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids > many of: 424035 [2243175]: s27 delta_renew long write time XX sec > > All these issues tell use that your storage is not working correctly. > > sanlock.log is full of renewal errors form May: > > $ grep 2022-05- sanlock.log | wc -l > 4844 > > $ grep 2022-05- sanlock.log | grep 'renewal error' | wc -l > 631 > > But there is lot of trouble from earlier months: > > $ grep 2022-04- sanlock.log | wc -l > 844 > $ grep 2022-04- sanlock.log | grep 'renewal error' | wc -l > 29 > > $ grep 2022-03- sanlock.log | wc -l > 1609 > $ grep 2022-03- sanlock.log | grep 'renewal error' | wc -l > 483 > > $ grep 2022-02- sanlock.log | wc -l > 826 > $ grep 2022-02- sanlock.log | grep 'renewal error' | wc -l > 242 > > Here sanlock log looks healthy: > > $ grep 2022-01- sanlock.log | wc -l > 3 > $ grep 2022-01- sanlock.log | grep 'renewal error' | wc -l > 0 > > $ grep 2021-12- sanlock.log | wc -l > 48 > $ grep 2021-12- sanlock.log | grep 'renewal error' | wc -l > 0 > > vdsm log shows that 2 domains are not accessible: > > $ grep ERROR vdsm.log > 2022-05-29 15:07:19,048+0200 ERROR (check/loop) [storage.monitor] > Error checking path > /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata > (monitor:511) > 2022-05-29 16:33:59,049+0200 ERROR (check/loop) [storage.monitor] > Error checking path > /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata > (monitor:511) > 2022-05-29 16:34:39,049+0200 ERROR (check/loop) [storage.monitor] > Error checking path > /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata > (monitor:511) > 2022-05-29 17:21:39,050+0200 ERROR (check/loop) [storage.monitor] > Error checking path > /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata > (monitor:511) > 2022-05-29 17:55:59,712+0200 ERROR (check/loop) [storage.monitor] > Error checking path > /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata > (monitor:511) > 2022-05-29 17:56:19,711+0200 ERROR (check/loop) [storage.monitor] > Error checking path > /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata > (monitor:511) > 2022-05-29 17:56:39,050+0200 ERROR (check/loop) [storage.monitor] > Error checking path > /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata > (monitor:511) > 2022-05-29 17:56:39,711+0200 ERROR (check/loop) [storage.monitor] > Error checking path > /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata > (monitor:511) > > You need to find what is the issue with your Gluster storage. > > I hope that Ritesh can help debug the issue with Gluster. > > Nir > > I'm worry that I do something, that it makes it even more worst, and I hove > not idea what's the problem. To me it looks not exactly like a problem with > data inconsistencies. The problem is that your Gluster storage is not healthy, and reading and writing to it times out. Please keep users@ovirt.org CC when you reply. Gluster storage is very popular in this mailing list and you may get useful help from other users. Nir _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/4NLSHJ6RNVNWPSJA3DD5ZLLP2KMEFREQ/