[ovirt-users] Re: HA VM Lease failure with full data storage domain

Nir Soffer Thu, 02 Jun 2022 13:21:32 -0700

On Thu, Jun 2, 2022 at 10:33 PM Patrick Hibbs <hibbsncc1...@gmail.com> wrote:
>
> Here's the ausearch results from that host. Looks like more than one
> issue. (openvswitch is also in there.)


I did not see anything related to the issues you reported and selinux
is likely not related. However there are unexpected denials that
may be harmless but should not appear in report.

I think filing a separate bug for the 2 kinds of deinals there makes
sense, someone should check and fix either the selinux policy or
the program trying to do stuff it should not.

I think should be reported for qemu-kvm in bugzilla:

time->Thu Jun  2 10:33:38 2022
type=PROCTITLE msg=audit(1654180418.940:5119):
proctitle=2F7573722F6C6962657865632F71656D752D6B766D002D6E616D650067756573743D57656253657276696365735F486F6E6F6B612C64656275672D746872656164733D6F6E002D53002D6F626A656374007B22716F6D2D74797065223A22736563726574222C226964223A226D61737465724B657930222C22666F726D617422
type=SYSCALL msg=audit(1654180418.940:5119): arch=c000003e syscall=257
success=no exit=-13 a0=ffffff9c a1=5647b7ffd910 a2=0 a3=0 items=0
ppid=1 pid=3639 auid=4294967295 uid=107 gid=107 euid=107 suid=107
fsuid=107 egid=107 sgid=107 fsgid=107 tty=(none) ses=4294967295
comm="qemu-kvm" exe="/usr/libexec/qemu-kvm"
subj=system_u:system_r:svirt_t:s0:c9,c704 key=(null)
type=AVC msg=audit(1654180418.940:5119): avc:  denied  { search } for
pid=3639 comm="qemu-kvm" name="1055" dev="proc" ino=28142
scontext=system_u:system_r:svirt_t:s0:c9,c704
tcontext=system_u:system_r:sanlock_t:s0-s0:c0.c1023 tclass=dir
permissive=0

I'm not sure where this should be reported, maybe kernel?

type=SYSCALL msg=audit(1651812155.891:50): arch=c000003e syscall=175
success=yes exit=0 a0=55bcab394ed0 a1=51494 a2=55bca960b8b6
a3=55bcaab64010 items=0 ppid=1274 pid=1282 auid=4294967295 uid=0 gid=0
euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295
comm="modprobe" exe="/usr/bin/kmod"
subj=system_u:system_r:openvswitch_load_module_t:s0 key=(null)
type=AVC msg=audit(1651812155.891:50): avc:  denied  { search } for
pid=1282 comm="modprobe" name="events" dev="tracefs" ino=2060
scontext=system_u:system_r:openvswitch_load_module_t:s0
tcontext=system_u:object_r:tracefs_t:s0 tclass=dir permissive=0
type=AVC msg=audit(1651812155.891:50): avc:  denied  { search } for
pid=1282 comm="modprobe" name="events" dev="tracefs" ino=2060
scontext=system_u:system_r:openvswitch_load_module_t:s0
tcontext=system_u:object_r:tracefs_t:s0 tclass=dir permissive=0

> I'll see about opening the bug. Should I file it on oVirt's github or
> the RedHat bugzilla?

Bugzilla is still the preferred place, but you can use github if you like,
we will look at it in both places.

Nir

> -Patrick Hibbs
>
> On Thu, 2022-06-02 at 22:08 +0300, Nir Soffer wrote:
> > On Thu, Jun 2, 2022 at 9:52 PM Patrick Hibbs <hibbsncc1...@gmail.com>
> > wrote:
> > >
> > > The attached logs are from the cluster hosts that were running the
> > > HA
> > > VMs during the failures.
> > >
> > > I've finally got all of my HA VMs up again. The last one didn't
> > > start
> > > again until after I freed up more space in the storage domain than
> > > what
> > > was originally available when the VM was running previously. (It
> > > now
> > > has over 150GB of free space. Which should be more than enough, but
> > > it
> > > didn't boot with 140GB avaiable....)
> > >
> > > SideNote:
> > > I just found this in the logs on the original host that the HA VMs
> > > were
> > > running on:
> > >
> > > ---snip---
> > > Jun 02 10:33:29 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:29
> > > 674607
> > > [1054]: s1 check_our_lease warning 71 last_success 674536
> > >                                                       # semanage
> > > fcontext -a -t virt_image_t '1055'
> > >                                                       *****  Plugin
> > > catchall (2.13 confidence) suggests   **************************
> > >                                                       Then you
> > > should
> > > report this as a bug.
> > >                                                       You can
> > > generate
> > > a local policy module to allow this access.
> > >                                                       Do
> >
> > Not clear what is the selinux issue. If you run:
> >
> >     ausearch -m avc
> >
> > It should be more clear.
> >
> > > Jun 02 10:33:45 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:45
> > > 674623
> > > [1054]: s1 kill 3441 sig 15 count 8
> > > Jun 02 10:33:45 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:45
> > > 674623
> > > [1054]: s1 kill 4337 sig 15 count 8
> > > Jun 02 10:33:46 ryuki.codenet sanlock[1054]: 2022-06-02 10:33:46
> > > 674624
> > > [1054]: s1 kill 3206 sig 15 count 9
> >
> > This means that the host could not access the storage for 80 seconds,
> > and the
> > leases expired. When leases expire, sanlock must kill the process
> > holding the
> > lease. Here we see that sanlock send a SIGTERM to 3 processes.
> >
> > If these are VMs, they will pause and libvirt will release the lease.
> >
> > I can check the log deeper next week.
> >
> > Nir
> >
> > > Jun 02 10:33:47 ryuki.codenet kernel: ovirtmgmt: port 4(vnet2)
> > > entered
> > > disabled state
> > > ---snip---
> > >
> > > That looks like some SELinux failure.
> > >
> > > -Patrick Hibbs
> > >
> > > On Thu, 2022-06-02 at 19:44 +0300, Nir Soffer wrote:
> > > > On Thu, Jun 2, 2022 at 7:14 PM Patrick Hibbs
> > > > <hibbsncc1...@gmail.com>
> > > > wrote:
> > > > >
> > > > > OK, so the data storage domain on a cluster filled up to the
> > > > > point
> > > > > that
> > > > > the OS refused to allocate any more space.
> > > > >
> > > > > This happened because I tried to create a new prealloc'd disk
> > > > > from
> > > > > the
> > > > > Admin WebUI. The disk creation claims to be completed
> > > > > successfully,
> > > > > I've not tried to use that disk yet, but due to a timeout with
> > > > > the
> > > > > storage domain in question the engine began trying to fence all
> > > > > of
> > > > > the
> > > > > HA VMs.
> > > > > The fencing failed for all of the HA VMs leaving them in a
> > > > > powered
> > > > > off
> > > > > state. Despite all of the HA VMs being up at the time, so no
> > > > > reallocation of the leases should have been necessary.
> > > >
> > > > Leases are not reallocated during fencing, not sure why you
> > > > expect
> > > > this to happen.
> > > >
> > > > > Attempting to
> > > > > restart them manually from the Admin WebUI failed. With the
> > > > > original
> > > > > host they were running on complaining about "no space left on
> > > > > device",
> > > > > and the other hosts claiming that the original host still held
> > > > > the
> > > > > VM
> > > > > lease.
> > > >
> > > > No space left on device may be an unfortunate error from sanlock,
> > > > meaning that there is no locksapce. This means the host has
> > > > trouble
> > > > adding the lockspace, or it did not complete yet.
> > > >
> > > > > After cleaning up some old snapshots, the HA VMs would still
> > > > > not
> > > > > boot.
> > > > > Toggling the High Availability setting for each one and
> > > > > allowing
> > > > > the
> > > > > lease to be removed from the storage domain was required to get
> > > > > the
> > > > > VMs
> > > > > to start again.
> > > >
> > > > If  you know that the VM is not running, disabling the lease
> > > > temporarily is
> > > > a good way to workaround the issue.
> > > >
> > > > > Re-enabling the High Availability setting there after
> > > > > fixed the lease issue. But now some, not all, of the HA VMs are
> > > > > still
> > > > > throwing "no space left on device" errors when attempting to
> > > > > start
> > > > > them. The others are working just fine even with their HA lease
> > > > > enabled.
> > > >
> > > > All erros come from same host(s) or some vms cannot start while
> > > > others can on the same host?
> > > >
> > > > > My questions are:
> > > > >
> > > > > 1. Why does oVirt claim to have a constantly allocated HA VM
> > > > > lease
> > > > > on
> > > > > the storage domain when it's clearly only done while the VM is
> > > > > running?
> > > >
> > > > Leases are allocated when a VM is created. This allocated a the
> > > > lease
> > > > space
> > > > (1MiB) in the external leases special volume, and bind it to the
> > > > VM
> > > > ID.
> > > >
> > > > When VM starts, it acquires the lease for its VM ID. If sanlock
> > > > is
> > > > not connected
> > > > to the lockspace on this host, this may fail with the confusing
> > > > "No space left on device" error.
> > > >
> > > > > 2. Why does oVirt deallocate the HA VM lease when performing a
> > > > > fencing
> > > > > operation?
> > > >
> > > > It does not. oVirt does not actually "fence" the VM. If the host
> > > > running the VM
> > > > cannot access storage and update the lease, the host lose all
> > > > leases
> > > > on that
> > > > storage. The result is pausing all the VM holding a lease on that
> > > > storage.
> > > >
> > > > oVirt will try to start the VM on another host, which will try to
> > > > acquire the lease
> > > > again on the new host. If enough time passed since the original
> > > > host
> > > > lost
> > > > access to storage, the lease can be acquired on the new host. If
> > > > not,
> > > > this
> > > > will happen in the next retrie(s).
> > > >
> > > > If the original host did not lose access to storage, and it is
> > > > still
> > > > updating the
> > > > lease you cannot acquire the lease from another host. This
> > > > protect
> > > > the VM
> > > > from split-brain that will corrupt the vm disk.
> > > >
> > > > > 3. Why can't oVirt clear the old HA VM lease when the VM is
> > > > > down
> > > > > and
> > > > > the storage pool has space available? (How much space is even
> > > > > needed?
> > > > > The leases section of the storage domain in the Admin WebUI
> > > > > doesn't
> > > > > contain any useful info beyond the fact that a lease should
> > > > > exist
> > > > > for a
> > > > > VM even when it's off.)
> > > >
> > > > Acquiring the lease is possible only if the lease is not held on
> > > > another host.
> > > >
> > > > oVirt does not support acquiring a held lease by killing the
> > > > process
> > > > holding
> > > > the lease on another host, but sanlock provides such capability.
> > > >
> > > > > 4. Is there a better way to force start a HA VM when the lease
> > > > > is
> > > > > old
> > > > > and the VM is powered off?
> > > >
> > > > If the original VM is powered off for enough time (2-3 minutes),
> > > > the
> > > > lease
> > > > expires and starting the VM on another host should succeed.
> > > >
> > > > > 5. Should I file a bug on the whole HA VM failing to reacquire
> > > > > a
> > > > > lease
> > > > > on a full storage pool?
> > > >
> > > > The external lease volume is not fully allocated. If you use thin
> > > > provisioned
> > > > storage, and the there is really no storage space, it is possible
> > > > that creating
> > > > a new lease will fail, but starting and stopping VM that have
> > > > leases
> > > > should not
> > > > be affected. But if you reach to the point when you don't have
> > > > enough
> > > > storage
> > > > space you have much bigger trouble and you should fix urgently.
> > > >
> > > > Do you really have issue with available space? What does engine
> > > > reports
> > > > about the storage domain? What does the underlying storage
> > > > reports?
> > > >
> > > > Nir
> > > >
> > >
> >
>
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/EAMXVJCWIMSK6R4HSNJZY6YMF2GLMSFK/

[ovirt-users] Re: HA VM Lease failure with full data storage domain

Reply via email to