Re: [ClusterLabs] Preventing multiple resources from moving at the same time.

Matthew Schumacher Wed, 21 Apr 2021 09:27:55 -0700

On 4/21/21 12:48 AM, Klaus Wenninger wrote:

Just to better understand the issue ...
Does the first resource implement storage that is being used
by the resource that is being migrated/moved?
Or is it just the combination of 2 parallel moves that is
overcommitting storage or network?
Is it assured that there are no load-scenarios inside these
resources that create the same issues as if you migrate/move
them?


Klaus


Thanks for the help Klaus, I'll spell it out more clearly.

I'm using a group resource sets up a failover-ip address, then mounts aZFS dataset (which exports a configuration directory as NFS), then acustom resource called ZFSiSCSI that exports all virtual machine disksas iSCSI.


Like this:

  * Resource Group: IP-ZFS-iSCSI:
    * fence-datastore    (stonith:fence_scsi):     Started node1
    * failover-ip    (ocf::heartbeat:IPaddr):     Started node1
    * zfs-datastore    (ocf::heartbeat:ZFS):     Started node1
    * ZFSiSCSI    (ocf::heartbeat:ZFSiSCSI):     Started node1

Then I create a virtual machine with

primitive vm-testvm VirtualDomain paramsconfig="/nfs/vm/testvm/testvm.xml" meta allow-migrate=true op monitortimeout=30 interval=10

This works fine because the ZFS storage can be mounted/exported on node1or node2 which will have an iSCSI target for each VM bound to the sharedIP address. I can move the storage to either node and while there is apause in the storage it works fine as things move around faster than theiscsi timeout. I can also migrate the VM to either node because whenit's started on the target node, it can immediately access it's iscsistorage regardless if the storage is local or not.

The problem is monitoring with VirtualDomain. The/usr/lib/ocf/resource.d/heartbeat/VirtualDomain script goes to check tosee if /nfs/vm/testvm/testvm.xml is available with this line:


        if [ ! -r $OCF_RESKEY_config ]; then
                if ocf_is_probe; then

ocf_log info "Configuration file$OCF_RESKEY_config not readable during probe."

That causes bash to stat the config file which if we are in the middleof a IP-ZFS-iSCSI move, will return -1 which then causes VirtualDomainto view the VM as dead and hard resets it.

If I set the stickiness to 100 then it's a race condition, many times weget the storage layer migrated without VirtualDomain noticing, but ifthe stickiness is not set, then moving a resource causes the cluster tore-balance and will cause the VM to fail every time because validationis one of the first things we do when we migrate the VM, and it's at thesame time as a IP-ZFS-iSCSI move so the config file goes away for 5 seconds.

I'm not sure how to fix this. The nodes don't have local storage thatisn't the ZFS pool, otherwise I'd just create a local config directoryand glusterfs them together.

I suppose the next step is to see if NFS has some sort of retry mode sothat bash stating the config file is blocked until a timeout. That wouldcertainly fix my issue as that's how the iscsi stuff works, retry untiltimeout. Another option is to rework VirtualDomain as stating a configfile isn't really a good test to see if the domain is working. It makesmore sense to have it make a virsh call to see if it's working and onlycare about the config file if it's starting the domain.


Ideas welcome!!!!

Matt

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Preventing multiple resources from moving at the same time.

Reply via email to