On Thu, Jun 2, 2022 at 7:14 PM Patrick Hibbs <hibbsncc1...@gmail.com> wrote:
> OK, so the data storage domain on a cluster filled up to the point that
> the OS refused to allocate any more space.
> This happened because I tried to create a new prealloc'd disk from the
> Admin WebUI. The disk creation claims to be completed successfully,
> I've not tried to use that disk yet, but due to a timeout with the
> storage domain in question the engine began trying to fence all of the
> HA VMs.
> The fencing failed for all of the HA VMs leaving them in a powered off
> state. Despite all of the HA VMs being up at the time, so no
> reallocation of the leases should have been necessary.

Leases are not reallocated during fencing, not sure why you expect
this to happen.

> Attempting to
> restart them manually from the Admin WebUI failed. With the original
> host they were running on complaining about "no space left on device",
> and the other hosts claiming that the original host still held the VM
> lease.

No space left on device may be an unfortunate error from sanlock,
meaning that there is no locksapce. This means the host has trouble
adding the lockspace, or it did not complete yet.

> After cleaning up some old snapshots, the HA VMs would still not boot.
> Toggling the High Availability setting for each one and allowing the
> lease to be removed from the storage domain was required to get the VMs
> to start again.

If  you know that the VM is not running, disabling the lease temporarily is
a good way to workaround the issue.

> Re-enabling the High Availability setting there after
> fixed the lease issue. But now some, not all, of the HA VMs are still
> throwing "no space left on device" errors when attempting to start
> them. The others are working just fine even with their HA lease
> enabled.

All erros come from same host(s) or some vms cannot start while
others can on the same host?

> My questions are:
> 1. Why does oVirt claim to have a constantly allocated HA VM lease on
> the storage domain when it's clearly only done while the VM is running?

Leases are allocated when a VM is created. This allocated a the lease space
(1MiB) in the external leases special volume, and bind it to the VM ID.

When VM starts, it acquires the lease for its VM ID. If sanlock is not connected
to the lockspace on this host, this may fail with the confusing
"No space left on device" error.

> 2. Why does oVirt deallocate the HA VM lease when performing a fencing
> operation?

It does not. oVirt does not actually "fence" the VM. If the host running the VM
cannot access storage and update the lease, the host lose all leases on that
storage. The result is pausing all the VM holding a lease on that storage.

oVirt will try to start the VM on another host, which will try to
acquire the lease
again on the new host. If enough time passed since the original host lost
access to storage, the lease can be acquired on the new host. If not, this
will happen in the next retrie(s).

If the original host did not lose access to storage, and it is still
updating the
lease you cannot acquire the lease from another host. This protect the VM
from split-brain that will corrupt the vm disk.

> 3. Why can't oVirt clear the old HA VM lease when the VM is down and
> the storage pool has space available? (How much space is even needed?
> The leases section of the storage domain in the Admin WebUI doesn't
> contain any useful info beyond the fact that a lease should exist for a
> VM even when it's off.)

Acquiring the lease is possible only if the lease is not held on another host.

oVirt does not support acquiring a held lease by killing the process holding
the lease on another host, but sanlock provides such capability.

> 4. Is there a better way to force start a HA VM when the lease is old
> and the VM is powered off?

If the original VM is powered off for enough time (2-3 minutes), the lease
expires and starting the VM on another host should succeed.

> 5. Should I file a bug on the whole HA VM failing to reacquire a lease
> on a full storage pool?

The external lease volume is not fully allocated. If you use thin provisioned
storage, and the there is really no storage space, it is possible that creating
a new lease will fail, but starting and stopping VM that have leases should not
be affected. But if you reach to the point when you don't have enough storage
space you have much bigger trouble and you should fix urgently.

Do you really have issue with available space? What does engine reports
about the storage domain? What does the underlying storage reports?

Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
List Archives: 

Reply via email to