On Fri, Feb 25, 2022 at 3:31 AM Reid Wahl <[email protected]> wrote: > > On Thu, Feb 24, 2022 at 2:28 AM Ulrich Windl > <[email protected]> wrote: > > > > Hi! > > > > I just discovered this oddity for a SLES15 SP3 cluster: > > Feb 24 11:16:17 h16 pacemaker-attrd[7274]: notice: Setting > > val_net_gw1[h18]: 1000 -> 139000 > > > > That surprised me, because usually the value is 1000 or 0. > > > > Diggding a bit further I found: > > Migration Summary: > > * Node: h18: > > * prm_ping_gw1: migration-threshold=1000000 fail-count=1 > > last-failure='Thu Feb 24 11:17:18 2022' > > > > Failed Resource Actions: > > * prm_ping_gw1_monitor_60000 on h18 'error' (1): call=200, > > status='Error', exitreason='', last-rc-change='2022-02-24 11:17:18 +01:00', > > queued=0ms, exec=0ms > > > > Digging further: > > Feb 24 11:16:17 h18 kernel: BUG: Bad rss-counter state mm:00000000c620b5fe > > idx:1 val:17 > > Feb 24 11:16:17 h18 pacemaker-attrd[6946]: notice: Setting > > val_net_gw1[h18]: 1000 -> 139000 > > Feb 24 11:17:17 h18 kernel: traps: pacemaker-execd[38950] general > > protection fault ip:7f610e71cbcf sp:7ffff7c25100 error:0 in > > libc-2.31.so[7f610e63b000+1e6000] > > > > (that rss-counter causing series of core dumps seems to be a new "feature" > > of SLES15 SP3 kernels that is being investigated by support) > > > > Somewhat later: > > Feb 24 11:17:18 h18 pacemaker-attrd[6946]: notice: Setting > > val_net_gw1[h18]: 139000 -> (unset) > > (restarted RA) > > Feb 24 11:17:21 h18 pacemaker-attrd[6946]: notice: Setting > > val_net_gw1[h18]: (unset) -> 1000 > > > > Another node: > > Feb 24 11:16:17 h19 pacemaker-attrd[7435]: notice: Setting > > val_net_gw1[h18]: 1000 -> 139000 > > Feb 24 11:17:18 h19 pacemaker-attrd[7435]: notice: Setting > > val_net_gw1[h18]: 139000 -> (unset) > > Feb 24 11:17:21 h19 pacemaker-attrd[7435]: notice: Setting > > val_net_gw1[h18]: (unset) -> 1000 > > > > So it seems the ping RA sets some garbage value when failing. Is that > > correct? > > This is ocf:pacemaker:ping, right? And is use_fping enabled? > > Looks like it uses ($active * $multiplier) -- see ping_update(). I'm > assuming your multiplier is 1000. > > $active is set by either fping_check() or ping_check(), depending on > your configuration. You can see what they're doing here. I'd assume > $active is getting set to 139 and then is multiplied by 1000 to set > $score later. > - > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.5/extra/resources/ping#L220-L277
It could also be a side effect of the fault though, since I don't see anything in fping_check() or ping_check() that's an obvious candidate for setting active=139 unless you have a massive host list. > > > > resource-agents-4.8.0+git30.d0077df0-150300.8.20.1.x86_64 > > pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64 > > > > Regards, > > Ulrich > > > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > -- > Regards, > > Reid Wahl (He/Him), RHCA > Senior Software Maintenance Engineer, Red Hat > CEE - Platform Support Delivery - ClusterHA -- Regards, Reid Wahl (He/Him), RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
