Here's the ausearch results from that host. Looks like more than one issue. (openvswitch is also in there.)
I'll see about opening the bug. Should I file it on oVirt's github or the RedHat bugzilla? -Patrick Hibbs On Thu, 2022-06-02 at 22:08 +0300, Nir Soffer wrote: > On Thu, Jun 2, 2022 at 9:52 PM Patrick Hibbs <hibbsncc1...@gmail.com> > wrote: > > > > The attached logs are from the cluster hosts that were running the > > HA > > VMs during the failures. > > > > I've finally got all of my HA VMs up again. The last one didn't > > start > > again until after I freed up more space in the storage domain than > > what > > was originally available when the VM was running previously. (It > > now > > has over 150GB of free space. Which should be more than enough, but > > it > > didn't boot with 140GB avaiable....) > > > > SideNote: > > I just found this in the logs on the original host that the HA VMs > > were > > running on: > > > > ---snip--- > > Jun 02 10:33:29 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:29 > > 674607 > > [1054]: s1 check_our_lease warning 71 last_success 674536 > > # semanage > > fcontext -a -t virt_image_t '1055' > > ***** Plugin > > catchall (2.13 confidence) suggests ************************** > > Then you > > should > > report this as a bug. > > You can > > generate > > a local policy module to allow this access. > > Do > > Not clear what is the selinux issue. If you run: > > ausearch -m avc > > It should be more clear. > > > Jun 02 10:33:45 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:45 > > 674623 > > [1054]: s1 kill 3441 sig 15 count 8 > > Jun 02 10:33:45 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:45 > > 674623 > > [1054]: s1 kill 4337 sig 15 count 8 > > Jun 02 10:33:46 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:46 > > 674624 > > [1054]: s1 kill 3206 sig 15 count 9 > > This means that the host could not access the storage for 80 seconds, > and the > leases expired. When leases expire, sanlock must kill the process > holding the > lease. Here we see that sanlock send a SIGTERM to 3 processes. > > If these are VMs, they will pause and libvirt will release the lease. > > I can check the log deeper next week. > > Nir > > > Jun 02 10:33:47 ryuki.codenet kernel: ovirtmgmt: port 4(vnet2) > > entered > > disabled state > > ---snip--- > > > > That looks like some SELinux failure. > > > > -Patrick Hibbs > > > > On Thu, 2022-06-02 at 19:44 +0300, Nir Soffer wrote: > > > On Thu, Jun 2, 2022 at 7:14 PM Patrick Hibbs > > > <hibbsncc1...@gmail.com> > > > wrote: > > > > > > > > OK, so the data storage domain on a cluster filled up to the > > > > point > > > > that > > > > the OS refused to allocate any more space. > > > > > > > > This happened because I tried to create a new prealloc'd disk > > > > from > > > > the > > > > Admin WebUI. The disk creation claims to be completed > > > > successfully, > > > > I've not tried to use that disk yet, but due to a timeout with > > > > the > > > > storage domain in question the engine began trying to fence all > > > > of > > > > the > > > > HA VMs. > > > > The fencing failed for all of the HA VMs leaving them in a > > > > powered > > > > off > > > > state. Despite all of the HA VMs being up at the time, so no > > > > reallocation of the leases should have been necessary. > > > > > > Leases are not reallocated during fencing, not sure why you > > > expect > > > this to happen. > > > > > > > Attempting to > > > > restart them manually from the Admin WebUI failed. With the > > > > original > > > > host they were running on complaining about "no space left on > > > > device", > > > > and the other hosts claiming that the original host still held > > > > the > > > > VM > > > > lease. > > > > > > No space left on device may be an unfortunate error from sanlock, > > > meaning that there is no locksapce. This means the host has > > > trouble > > > adding the lockspace, or it did not complete yet. > > > > > > > After cleaning up some old snapshots, the HA VMs would still > > > > not > > > > boot. > > > > Toggling the High Availability setting for each one and > > > > allowing > > > > the > > > > lease to be removed from the storage domain was required to get > > > > the > > > > VMs > > > > to start again. > > > > > > If you know that the VM is not running, disabling the lease > > > temporarily is > > > a good way to workaround the issue. > > > > > > > Re-enabling the High Availability setting there after > > > > fixed the lease issue. But now some, not all, of the HA VMs are > > > > still > > > > throwing "no space left on device" errors when attempting to > > > > start > > > > them. The others are working just fine even with their HA lease > > > > enabled. > > > > > > All erros come from same host(s) or some vms cannot start while > > > others can on the same host? > > > > > > > My questions are: > > > > > > > > 1. Why does oVirt claim to have a constantly allocated HA VM > > > > lease > > > > on > > > > the storage domain when it's clearly only done while the VM is > > > > running? > > > > > > Leases are allocated when a VM is created. This allocated a the > > > lease > > > space > > > (1MiB) in the external leases special volume, and bind it to the > > > VM > > > ID. > > > > > > When VM starts, it acquires the lease for its VM ID. If sanlock > > > is > > > not connected > > > to the lockspace on this host, this may fail with the confusing > > > "No space left on device" error. > > > > > > > 2. Why does oVirt deallocate the HA VM lease when performing a > > > > fencing > > > > operation? > > > > > > It does not. oVirt does not actually "fence" the VM. If the host > > > running the VM > > > cannot access storage and update the lease, the host lose all > > > leases > > > on that > > > storage. The result is pausing all the VM holding a lease on that > > > storage. > > > > > > oVirt will try to start the VM on another host, which will try to > > > acquire the lease > > > again on the new host. If enough time passed since the original > > > host > > > lost > > > access to storage, the lease can be acquired on the new host. If > > > not, > > > this > > > will happen in the next retrie(s). > > > > > > If the original host did not lose access to storage, and it is > > > still > > > updating the > > > lease you cannot acquire the lease from another host. This > > > protect > > > the VM > > > from split-brain that will corrupt the vm disk. > > > > > > > 3. Why can't oVirt clear the old HA VM lease when the VM is > > > > down > > > > and > > > > the storage pool has space available? (How much space is even > > > > needed? > > > > The leases section of the storage domain in the Admin WebUI > > > > doesn't > > > > contain any useful info beyond the fact that a lease should > > > > exist > > > > for a > > > > VM even when it's off.) > > > > > > Acquiring the lease is possible only if the lease is not held on > > > another host. > > > > > > oVirt does not support acquiring a held lease by killing the > > > process > > > holding > > > the lease on another host, but sanlock provides such capability. > > > > > > > 4. Is there a better way to force start a HA VM when the lease > > > > is > > > > old > > > > and the VM is powered off? > > > > > > If the original VM is powered off for enough time (2-3 minutes), > > > the > > > lease > > > expires and starting the VM on another host should succeed. > > > > > > > 5. Should I file a bug on the whole HA VM failing to reacquire > > > > a > > > > lease > > > > on a full storage pool? > > > > > > The external lease volume is not fully allocated. If you use thin > > > provisioned > > > storage, and the there is really no storage space, it is possible > > > that creating > > > a new lease will fail, but starting and stopping VM that have > > > leases > > > should not > > > be affected. But if you reach to the point when you don't have > > > enough > > > storage > > > space you have much bigger trouble and you should fix urgently. > > > > > > Do you really have issue with available space? What does engine > > > reports > > > about the storage domain? What does the underlying storage > > > reports? > > > > > > Nir > > > > > >
ryuki.ausearch.log.xz
Description: application/xz
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/H2QQD4EEC7HBD7YHWH4O333H7QQUSA63/