On Thu, Jun 2, 2022 at 7:44 PM Nir Soffer <nsof...@redhat.com> wrote:
>
> On Thu, Jun 2, 2022 at 7:14 PM Patrick Hibbs <hibbsncc1...@gmail.com> wrote:
> >
> > OK, so the data storage domain on a cluster filled up to the point that
> > the OS refused to allocate any more space.
> >
> > This happened because I tried to create a new prealloc'd disk from the
> > Admin WebUI. The disk creation claims to be completed successfully,
> > I've not tried to use that disk yet, but due to a timeout with the
> > storage domain in question the engine began trying to fence all of the
> > HA VMs.
> > The fencing failed for all of the HA VMs leaving them in a powered off
> > state. Despite all of the HA VMs being up at the time, so no
> > reallocation of the leases should have been necessary.
>
> Leases are not reallocated during fencing, not sure why you expect
> this to happen.
>
> > Attempting to
> > restart them manually from the Admin WebUI failed. With the original
> > host they were running on complaining about "no space left on device",
> > and the other hosts claiming that the original host still held the VM
> > lease.
>
> No space left on device may be an unfortunate error from sanlock,
> meaning that there is no locksapce. This means the host has trouble
> adding the lockspace, or it did not complete yet.
>
> > After cleaning up some old snapshots, the HA VMs would still not boot.
> > Toggling the High Availability setting for each one and allowing the
> > lease to be removed from the storage domain was required to get the VMs
> > to start again.
>
> If  you know that the VM is not running, disabling the lease temporarily is
> a good way to workaround the issue.
>
> > Re-enabling the High Availability setting there after
> > fixed the lease issue. But now some, not all, of the HA VMs are still
> > throwing "no space left on device" errors when attempting to start
> > them. The others are working just fine even with their HA lease
> > enabled.
>
> All erros come from same host(s) or some vms cannot start while
> others can on the same host?
>
> > My questions are:
> >
> > 1. Why does oVirt claim to have a constantly allocated HA VM lease on
> > the storage domain when it's clearly only done while the VM is running?
>
> Leases are allocated when a VM is created. This allocated a the lease space
> (1MiB) in the external leases special volume, and bind it to the VM ID.
>
> When VM starts, it acquires the lease for its VM ID. If sanlock is not 
> connected
> to the lockspace on this host, this may fail with the confusing
> "No space left on device" error.
>
> > 2. Why does oVirt deallocate the HA VM lease when performing a fencing
> > operation?
>
> It does not. oVirt does not actually "fence" the VM. If the host running the 
> VM
> cannot access storage and update the lease, the host lose all leases on that
> storage. The result is pausing all the VM holding a lease on that storage.
>
> oVirt will try to start the VM on another host, which will try to
> acquire the lease
> again on the new host. If enough time passed since the original host lost
> access to storage, the lease can be acquired on the new host. If not, this
> will happen in the next retrie(s).
>
> If the original host did not lose access to storage, and it is still
> updating the
> lease you cannot acquire the lease from another host. This protect the VM
> from split-brain that will corrupt the vm disk.
>
> > 3. Why can't oVirt clear the old HA VM lease when the VM is down and
> > the storage pool has space available? (How much space is even needed?
> > The leases section of the storage domain in the Admin WebUI doesn't
> > contain any useful info beyond the fact that a lease should exist for a
> > VM even when it's off.)
>
> Acquiring the lease is possible only if the lease is not held on another host.
>
> oVirt does not support acquiring a held lease by killing the process holding
> the lease on another host, but sanlock provides such capability.
>
> > 4. Is there a better way to force start a HA VM when the lease is old
> > and the VM is powered off?
>
> If the original VM is powered off for enough time (2-3 minutes), the lease
> expires and starting the VM on another host should succeed.
>
> > 5. Should I file a bug on the whole HA VM failing to reacquire a lease
> > on a full storage pool?
>
> The external lease volume is not fully allocated. If you use thin provisioned
> storage, and the there is really no storage space, it is possible that 
> creating
> a new lease will fail, but starting and stopping VM that have leases should 
> not
> be affected. But if you reach to the point when you don't have enough storage
> space you have much bigger trouble and you should fix urgently.
>
> Do you really have issue with available space? What does engine reports
> about the storage domain? What does the underlying storage reports?

I forgot to answer the question about filing a bug - yes. This mail does not
have enough info to understand the issue. Please file a bug describing
what you experienced, and what you expect to happen. Attach engine and
vdsm logs from all hosts showing the relevant timeframe.

Nir
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/AAZMWNJ5VC5CQF2HGLUD2RCB7EDPRHPK/

Reply via email to