[ovirt-users] Re: HA VM Lease failure with full data storage domain

Patrick Hibbs Thu, 02 Jun 2022 12:33:49 -0700

Here's the ausearch results from that host. Looks like more than one
issue. (openvswitch is also in there.)


I'll see about opening the bug. Should I file it on oVirt's github or
the RedHat bugzilla?

-Patrick Hibbs

On Thu, 2022-06-02 at 22:08 +0300, Nir Soffer wrote:
> On Thu, Jun 2, 2022 at 9:52 PM Patrick Hibbs <hibbsncc1...@gmail.com>
> wrote:
> > 
> > The attached logs are from the cluster hosts that were running the
> > HA
> > VMs during the failures.
> > 
> > I've finally got all of my HA VMs up again. The last one didn't
> > start
> > again until after I freed up more space in the storage domain than
> > what
> > was originally available when the VM was running previously. (It
> > now
> > has over 150GB of free space. Which should be more than enough, but
> > it
> > didn't boot with 140GB avaiable....)
> > 
> > SideNote:
> > I just found this in the logs on the original host that the HA VMs
> > were
> > running on:
> > 
> > ---snip---
> > Jun 02 10:33:29 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:29
> > 674607
> > [1054]: s1 check_our_lease warning 71 last_success 674536
> >                                                       # semanage
> > fcontext -a -t virt_image_t '1055'
> >                                                       *****  Plugin
> > catchall (2.13 confidence) suggests   **************************
> >                                                       Then you
> > should
> > report this as a bug.
> >                                                       You can
> > generate
> > a local policy module to allow this access.
> >                                                       Do
> 
> Not clear what is the selinux issue. If you run:
> 
>     ausearch -m avc
> 
> It should be more clear.
> 
> > Jun 02 10:33:45 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:45
> > 674623
> > [1054]: s1 kill 3441 sig 15 count 8
> > Jun 02 10:33:45 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:45
> > 674623
> > [1054]: s1 kill 4337 sig 15 count 8
> > Jun 02 10:33:46 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:46
> > 674624
> > [1054]: s1 kill 3206 sig 15 count 9
> 
> This means that the host could not access the storage for 80 seconds,
> and the
> leases expired. When leases expire, sanlock must kill the process
> holding the
> lease. Here we see that sanlock send a SIGTERM to 3 processes.
> 
> If these are VMs, they will pause and libvirt will release the lease.
> 
> I can check the log deeper next week.
> 
> Nir
> 
> > Jun 02 10:33:47 ryuki.codenet kernel: ovirtmgmt: port 4(vnet2)
> > entered
> > disabled state
> > ---snip---
> > 
> > That looks like some SELinux failure.
> > 
> > -Patrick Hibbs
> > 
> > On Thu, 2022-06-02 at 19:44 +0300, Nir Soffer wrote:
> > > On Thu, Jun 2, 2022 at 7:14 PM Patrick Hibbs
> > > <hibbsncc1...@gmail.com>
> > > wrote:
> > > > 
> > > > OK, so the data storage domain on a cluster filled up to the
> > > > point
> > > > that
> > > > the OS refused to allocate any more space.
> > > > 
> > > > This happened because I tried to create a new prealloc'd disk
> > > > from
> > > > the
> > > > Admin WebUI. The disk creation claims to be completed
> > > > successfully,
> > > > I've not tried to use that disk yet, but due to a timeout with
> > > > the
> > > > storage domain in question the engine began trying to fence all
> > > > of
> > > > the
> > > > HA VMs.
> > > > The fencing failed for all of the HA VMs leaving them in a
> > > > powered
> > > > off
> > > > state. Despite all of the HA VMs being up at the time, so no
> > > > reallocation of the leases should have been necessary.
> > > 
> > > Leases are not reallocated during fencing, not sure why you
> > > expect
> > > this to happen.
> > > 
> > > > Attempting to
> > > > restart them manually from the Admin WebUI failed. With the
> > > > original
> > > > host they were running on complaining about "no space left on
> > > > device",
> > > > and the other hosts claiming that the original host still held
> > > > the
> > > > VM
> > > > lease.
> > > 
> > > No space left on device may be an unfortunate error from sanlock,
> > > meaning that there is no locksapce. This means the host has
> > > trouble
> > > adding the lockspace, or it did not complete yet.
> > > 
> > > > After cleaning up some old snapshots, the HA VMs would still
> > > > not
> > > > boot.
> > > > Toggling the High Availability setting for each one and
> > > > allowing
> > > > the
> > > > lease to be removed from the storage domain was required to get
> > > > the
> > > > VMs
> > > > to start again.
> > > 
> > > If  you know that the VM is not running, disabling the lease
> > > temporarily is
> > > a good way to workaround the issue.
> > > 
> > > > Re-enabling the High Availability setting there after
> > > > fixed the lease issue. But now some, not all, of the HA VMs are
> > > > still
> > > > throwing "no space left on device" errors when attempting to
> > > > start
> > > > them. The others are working just fine even with their HA lease
> > > > enabled.
> > > 
> > > All erros come from same host(s) or some vms cannot start while
> > > others can on the same host?
> > > 
> > > > My questions are:
> > > > 
> > > > 1. Why does oVirt claim to have a constantly allocated HA VM
> > > > lease
> > > > on
> > > > the storage domain when it's clearly only done while the VM is
> > > > running?
> > > 
> > > Leases are allocated when a VM is created. This allocated a the
> > > lease
> > > space
> > > (1MiB) in the external leases special volume, and bind it to the
> > > VM
> > > ID.
> > > 
> > > When VM starts, it acquires the lease for its VM ID. If sanlock
> > > is
> > > not connected
> > > to the lockspace on this host, this may fail with the confusing
> > > "No space left on device" error.
> > > 
> > > > 2. Why does oVirt deallocate the HA VM lease when performing a
> > > > fencing
> > > > operation?
> > > 
> > > It does not. oVirt does not actually "fence" the VM. If the host
> > > running the VM
> > > cannot access storage and update the lease, the host lose all
> > > leases
> > > on that
> > > storage. The result is pausing all the VM holding a lease on that
> > > storage.
> > > 
> > > oVirt will try to start the VM on another host, which will try to
> > > acquire the lease
> > > again on the new host. If enough time passed since the original
> > > host
> > > lost
> > > access to storage, the lease can be acquired on the new host. If
> > > not,
> > > this
> > > will happen in the next retrie(s).
> > > 
> > > If the original host did not lose access to storage, and it is
> > > still
> > > updating the
> > > lease you cannot acquire the lease from another host. This
> > > protect
> > > the VM
> > > from split-brain that will corrupt the vm disk.
> > > 
> > > > 3. Why can't oVirt clear the old HA VM lease when the VM is
> > > > down
> > > > and
> > > > the storage pool has space available? (How much space is even
> > > > needed?
> > > > The leases section of the storage domain in the Admin WebUI
> > > > doesn't
> > > > contain any useful info beyond the fact that a lease should
> > > > exist
> > > > for a
> > > > VM even when it's off.)
> > > 
> > > Acquiring the lease is possible only if the lease is not held on
> > > another host.
> > > 
> > > oVirt does not support acquiring a held lease by killing the
> > > process
> > > holding
> > > the lease on another host, but sanlock provides such capability.
> > > 
> > > > 4. Is there a better way to force start a HA VM when the lease
> > > > is
> > > > old
> > > > and the VM is powered off?
> > > 
> > > If the original VM is powered off for enough time (2-3 minutes),
> > > the
> > > lease
> > > expires and starting the VM on another host should succeed.
> > > 
> > > > 5. Should I file a bug on the whole HA VM failing to reacquire
> > > > a
> > > > lease
> > > > on a full storage pool?
> > > 
> > > The external lease volume is not fully allocated. If you use thin
> > > provisioned
> > > storage, and the there is really no storage space, it is possible
> > > that creating
> > > a new lease will fail, but starting and stopping VM that have
> > > leases
> > > should not
> > > be affected. But if you reach to the point when you don't have
> > > enough
> > > storage
> > > space you have much bigger trouble and you should fix urgently.
> > > 
> > > Do you really have issue with available space? What does engine
> > > reports
> > > about the storage domain? What does the underlying storage
> > > reports?
> > > 
> > > Nir
> > > 
> > 
>

ryuki.ausearch.log.xz
Description: application/xz

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/H2QQD4EEC7HBD7YHWH4O333H7QQUSA63/

[ovirt-users] Re: HA VM Lease failure with full data storage domain

Reply via email to