On Thu, Jun 2, 2022 at 7:44 PM Nir Soffer <nsof...@redhat.com> wrote: > > On Thu, Jun 2, 2022 at 7:14 PM Patrick Hibbs <hibbsncc1...@gmail.com> wrote: > > > > OK, so the data storage domain on a cluster filled up to the point that > > the OS refused to allocate any more space. > > > > This happened because I tried to create a new prealloc'd disk from the > > Admin WebUI. The disk creation claims to be completed successfully, > > I've not tried to use that disk yet, but due to a timeout with the > > storage domain in question the engine began trying to fence all of the > > HA VMs. > > The fencing failed for all of the HA VMs leaving them in a powered off > > state. Despite all of the HA VMs being up at the time, so no > > reallocation of the leases should have been necessary. > > Leases are not reallocated during fencing, not sure why you expect > this to happen. > > > Attempting to > > restart them manually from the Admin WebUI failed. With the original > > host they were running on complaining about "no space left on device", > > and the other hosts claiming that the original host still held the VM > > lease. > > No space left on device may be an unfortunate error from sanlock, > meaning that there is no locksapce. This means the host has trouble > adding the lockspace, or it did not complete yet. > > > After cleaning up some old snapshots, the HA VMs would still not boot. > > Toggling the High Availability setting for each one and allowing the > > lease to be removed from the storage domain was required to get the VMs > > to start again. > > If you know that the VM is not running, disabling the lease temporarily is > a good way to workaround the issue. > > > Re-enabling the High Availability setting there after > > fixed the lease issue. But now some, not all, of the HA VMs are still > > throwing "no space left on device" errors when attempting to start > > them. The others are working just fine even with their HA lease > > enabled. > > All erros come from same host(s) or some vms cannot start while > others can on the same host? > > > My questions are: > > > > 1. Why does oVirt claim to have a constantly allocated HA VM lease on > > the storage domain when it's clearly only done while the VM is running? > > Leases are allocated when a VM is created. This allocated a the lease space > (1MiB) in the external leases special volume, and bind it to the VM ID. > > When VM starts, it acquires the lease for its VM ID. If sanlock is not > connected > to the lockspace on this host, this may fail with the confusing > "No space left on device" error. > > > 2. Why does oVirt deallocate the HA VM lease when performing a fencing > > operation? > > It does not. oVirt does not actually "fence" the VM. If the host running the > VM > cannot access storage and update the lease, the host lose all leases on that > storage. The result is pausing all the VM holding a lease on that storage. > > oVirt will try to start the VM on another host, which will try to > acquire the lease > again on the new host. If enough time passed since the original host lost > access to storage, the lease can be acquired on the new host. If not, this > will happen in the next retrie(s). > > If the original host did not lose access to storage, and it is still > updating the > lease you cannot acquire the lease from another host. This protect the VM > from split-brain that will corrupt the vm disk. > > > 3. Why can't oVirt clear the old HA VM lease when the VM is down and > > the storage pool has space available? (How much space is even needed? > > The leases section of the storage domain in the Admin WebUI doesn't > > contain any useful info beyond the fact that a lease should exist for a > > VM even when it's off.) > > Acquiring the lease is possible only if the lease is not held on another host. > > oVirt does not support acquiring a held lease by killing the process holding > the lease on another host, but sanlock provides such capability. > > > 4. Is there a better way to force start a HA VM when the lease is old > > and the VM is powered off? > > If the original VM is powered off for enough time (2-3 minutes), the lease > expires and starting the VM on another host should succeed. > > > 5. Should I file a bug on the whole HA VM failing to reacquire a lease > > on a full storage pool? > > The external lease volume is not fully allocated. If you use thin provisioned > storage, and the there is really no storage space, it is possible that > creating > a new lease will fail, but starting and stopping VM that have leases should > not > be affected. But if you reach to the point when you don't have enough storage > space you have much bigger trouble and you should fix urgently. > > Do you really have issue with available space? What does engine reports > about the storage domain? What does the underlying storage reports?
I forgot to answer the question about filing a bug - yes. This mail does not have enough info to understand the issue. Please file a bug describing what you experienced, and what you expect to happen. Attach engine and vdsm logs from all hosts showing the relevant timeframe. Nir _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/AAZMWNJ5VC5CQF2HGLUD2RCB7EDPRHPK/