On Sat, 2020-04-11 at 01:03 -0400, Marc Smith wrote: > On Wed, Apr 1, 2020 at 8:01 PM Ken Gaillot <[email protected]> > wrote: > > > > On Thu, 2020-03-19 at 13:39 -0400, Marc Smith wrote: > > > On Mon, Mar 16, 2020 at 1:26 PM Marc Smith <[email protected]> > > > wrote: > > > > > > > > On Thu, Mar 12, 2020 at 10:51 AM Ken Gaillot < > > > > [email protected]> > > > > wrote: > > > > > > > > > > On Wed, 2020-03-11 at 17:24 -0400, Marc Smith wrote: > > > > > > Hi, > > > > > > > > > > > > I'm using Pacemaker 1.1.20 (yes, I know, a bit dated now). > > > > > > I > > > > > > noticed > > > > > > > > > > I'd still consider that recent :) > > > > > > > > > > > when I modify a resource parameter (eg, update the value), > > > > > > this > > > > > > causes > > > > > > the resource itself to restart. And that's fine, but when > > > > > > this > > > > > > resource is restarted, it doesn't appear to honor the full > > > > > > set > > > > > > of > > > > > > constraints for that resource. > > > > > > > > > > > > I see the output like this (right after the resource > > > > > > parameter > > > > > > change): > > > > > > ... > > > > > > Mar 11 20:43:25 localhost crmd[1943]: notice: State > > > > > > transition > > > > > > S_IDLE -> S_POL > > > > > > ICY_ENGINE > > > > > > Mar 11 20:43:25 localhost crmd[1943]: notice: Current > > > > > > ping > > > > > > state: > > > > > > S_POLICY_ENG > > > > > > INE > > > > > > Mar 11 20:43:25 localhost pengine[1942]: notice: Clearing > > > > > > failure > > > > > > of > > > > > > p_bmd_140c58-1 on 140c58-1 because resource parameters have > > > > > > changed > > > > > > Mar 11 20:43:25 localhost pengine[1942]: notice: * > > > > > > Restart > > > > > > p_bmd_140c58-1 ( 140c58-1 > > > > > > ) due > > > > > > to > > > > > > resource definition change > > > > > > Mar 11 20:43:25 localhost pengine[1942]: notice: * > > > > > > Restart > > > > > > p_dummy_g_lvm_140c58-1 ( 140c58-1 > > > > > > ) due > > > > > > to > > > > > > required g_md_140c58-1 running > > > > > > Mar 11 20:43:25 localhost pengine[1942]: notice: * > > > > > > Restart > > > > > > p_lvm_140c58_vg_01 ( 140c58-1 > > > > > > ) due > > > > > > to > > > > > > required p_dummy_g_lvm_140c58-1 start > > > > > > Mar 11 20:43:25 localhost pengine[1942]: notice: > > > > > > Calculated > > > > > > transition 41, saving inputs in > > > > > > /var/lib/pacemaker/pengine/pe-input-173.bz2 > > > > > > Mar 11 20:43:25 localhost crmd[1943]: notice: Initiating > > > > > > stop > > > > > > operation p_lvm_140c58_vg_01_stop_0 on 140c58-1 > > > > > > Mar 11 20:43:25 localhost crmd[1943]: notice: Transition > > > > > > aborted by > > > > > > deletion of lrm_rsc_op[@id='p_bmd_140c58- > > > > > > 1_last_failure_0']: > > > > > > Resource > > > > > > operation removal > > > > > > Mar 11 20:43:25 localhost crmd[1943]: notice: Current > > > > > > ping > > > > > > state: > > > > > > S_TRANSITION_ENGINE > > > > > > ... > > > > > > > > > > > > The stop on 'p_lvm_140c58_vg_01' then times out, because > > > > > > the > > > > > > other > > > > > > constraint (to stop the service above LVM) is never > > > > > > executed. I > > > > > > can > > > > > > see from the messages it never even tries to demote the > > > > > > resource > > > > > > above > > > > > > that. > > > > > > > > > > > > Yet, if I use crmsh at the shell, and do a restart on that > > > > > > same > > > > > > resource, it works correctly, and all constraints are > > > > > > honored: > > > > > > crm > > > > > > resource restart p_bmd_140c58-1 > > > > > > > > > > > > I can certainly provide my full cluster config if needed, > > > > > > but > > > > > > hoping > > > > > > to keep this email concise for clarity. =) > > > > > > > > > > > > I guess my questions are: 1) Is the difference in restart > > > > > > behavior > > > > > > expected, and not all constraints are followed when > > > > > > resource > > > > > > parameters change (or some other restart event that > > > > > > originated > > > > > > internally like this)? 2) Or perhaps this is known bug that > > > > > > was > > > > > > already resolved in newer versions of Pacemaker? > > > > > > > > > > No to both. Can you attach that pe-input-173.bz2 file (with > > > > > any > > > > > sensitive info removed)? > > > > > > > > Thanks; that system got wiped, so I reproduced it on another > > > > system > > > > and I am attaching that pe-input file. Log snippet is below for > > > > completeness: > > > > > > > > Mar 16 17:16:50 localhost crmd[1340]: notice: State > > > > transition > > > > S_IDLE -> S_POL > > > > ICY_ENGINE > > > > Mar 16 17:16:50 localhost pengine[1339]: notice: * Restart > > > > p_bmd_126c4f-1 ( 126c4f-1 ) due > > > > to > > > > resource definition change > > > > Mar 16 17:16:50 localhost pengine[1339]: notice: * Restart > > > > p_dummy_g_lvm_126c4f-1 ( 126c4f-1 ) due > > > > to > > > > required g_md_126c4f-1 running > > > > Mar 16 17:16:50 localhost pengine[1339]: notice: * Restart > > > > p_lvm_126c4f_vg_01 ( 126c4f-1 ) due > > > > to > > > > required p_dummy_g_lvm_126c4f-1 start > > > > Mar 16 17:16:50 localhost pengine[1339]: notice: Calculated > > > > transition 149, saving inputs in > > > > /var/lib/pacemaker/pengine/pe-input-46.bz2 > > > > > > > > > > Hi Ken, > > > > > > Just a friendly bump to see if you had a chance to take a look at > > > this > > > issue? I appreciate your time and expertise! =) > > > > > > --Marc > > > > Sorry, I've been slammed lately. > > No problem at all, appreciate you taking the time to investigate. > > > > > > There does appear to be a scheduler bug. The relevant constraint is > > (in > > plain language) > > > > start g_lvm_* then promote ms_alua_* > > > > The implicit inverse of that is > > > > demote ms_alua_* then stop g_lvm_* > > > > The bug is that ms_alua_* isn't demoted before g_lvm_* is stopped. > > (Note however that the configuration does not require ms_alua_* to > > be > > stopped.) > > Anything I can do to debug further? I've worked around this for now > in > my particular use case by simply stopping the ms_alua_* resource > before modifying the resource parameter, not ideal, but okay for now. > > --Marc
I've stripped out as much as possible from your config to come up with a minimal reproducer. It's surprising since the result is a simple, common setup. I've also verified it's not a regression. I suppose no one's bothered to report it before this since changing a parameter is rare once a cluster is in production. Your workaround is the only thing I can recommend at the moment. > > > > > > > > > > > > --Marc > > > > > > > > > > > > > > > > > > > > I searched a bit for #2 but I didn't get many (well any) > > > > > > hits > > > > > > on > > > > > > other > > > > > > users experiencing this behavior. > > > > > > > > > > > > Many thanks in advance. > > > > > > > > > > > > --Marc -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
