>>> Ken Gaillot <[email protected]> schrieb am 10.10.2019 um 21:19 in Nachricht <[email protected]>: > On Thu, 2019‑10‑10 at 17:22 +0200, Lentes, Bernd wrote: >> HI, >> >> i have a two node cluster running on SLES 12 SP4. >> I did some testing on it. >> I put one into standby (ha‑idg‑2), the other (ha‑idg‑1) got fenced a >> few minutes later because i made a mistake. >> ha‑idg‑2 was DC. ha‑idg‑1 made a fresh boot and i started >> corosync/pacemaker on it. >> It seems ha‑idg‑1 didn't find the DC after starting cluster and some >> sec later elected itself to the DC, >> afterwards fenced ha‑idg‑2. > > For some reason, corosync on the two nodes was not able to communicate > with each other. > > This type of situation is why corosync's two_node option normally > includes wait_for_all. > >> >> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice [MAIN ] Corosync >> Cluster Engine ('2.3.6'): started and ready to provide service. >> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync info [MAIN ] Corosync >> built‑in features: debug testagents augeas systemd pie relro bindnow >> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice [TOTEM ] >> Initializing transport (UDP/IP Multicast). >> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice [TOTEM ] >> Initializing transmit/receive security (NSS) crypto: aes256 hash: >> sha1 >> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice [TOTEM ] The network >> interface [192.168.100.10] is now up. >> >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info: >> crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped >> (20000ms) >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: warning: do_log: Input >> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info: >> do_state_transition: State transition S_PENDING ‑> S_ELECTION | >> input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info: >> election_check: election‑DC won by local node >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info: do_log: Input >> I_ELECTION_DC received in state S_ELECTION from election_win_cb >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: notice: >> do_state_transition: State transition S_ELECTION ‑> >> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL >> origin=election_win_cb >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info: >> do_te_control: Registering TE UUID: f302e1d4‑a1aa‑4a3e‑b9dd‑ >> 71bd17047f82 >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info: >> set_graph_functions: Setting custom graph functions >> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info: >> do_dc_takeover: Taking over DC status for this partition >> >> Oct 09 18:05:07 [9564] ha‑idg‑1 pengine: warning: >> stage6: Scheduling Node ha‑idg‑2 for STONITH >> Oct 09 18:05:07 [9564] ha‑idg‑1 pengine: notice: >> LogNodeActions: * Fence (Off) ha‑idg‑2 'node is unclean' >> >> Is my understanding correct ? > > Yes > >> In the log of ha‑idg‑2 i don't find anything for this period: >> >> Oct 09 17:58:46 [12504] ha‑idg‑2 stonith‑ng: info: >> cib_device_update: Device fence_ilo_ha‑idg‑2 has been disabled >> on ha‑idg‑2: score=‑10000 >> Oct 09 17:58:51 [12503] ha‑idg‑2 cib: info: >> cib_process_ping: Reporting our current digest to ha‑idg‑2: >> 59c4cfb14defeafbeb3417e222242cd9 for 2.9506.36 (0x242b110 0) >> >> Oct 09 18:00:42 [12508] ha‑idg‑2 crmd: info: >> throttle_send_command: New throttle mode: 0001 (was 0000) >> Oct 09 18:01:12 [12508] ha‑idg‑2 crmd: info: >> throttle_check_thresholds: Moderate CPU load detected: >> 32.220001 >> Oct 09 18:01:12 [12508] ha‑idg‑2 crmd: info: >> throttle_send_command: New throttle mode: 0010 (was 0001) >> Oct 09 18:01:42 [12508] ha‑idg‑2 crmd: info: >> throttle_send_command: New throttle mode: 0001 (was 0010) >> Oct 09 18:02:42 [12508] ha‑idg‑2 crmd: info: >> throttle_send_command: New throttle mode: 0000 (was 0001) >> >> ha‑idg‑2 is fenced and after a reboot i started corosync/pacmeaker on >> it again: >> >> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice [MAIN ] Corosync >> Cluster Engine ('2.3.6'): started and ready to provide service. >> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync info [MAIN ] Corosync >> built‑in features: debug testagents augeas systemd pie relro bindnow >> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice [TOTEM ] >> Initializing transport (UDP/IP Multicast). >> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice [TOTEM ] >> Initializing transmit/receive security (NSS) crypto: aes256 hash: >> sha1 >> >> What is the meaning of the lines with the throttle ? > > Those messages could definitely be improved. The particular mode values > indicate no significant CPU load (0000), low load (0001), medium > (0010), high (0100), or extreme (1000).
Funny: save a few bytes here, but waste many elsewhere ;-) > > I wouldn't expect a CPU spike to lock up corosync for very long, but it > could be related somehow. > >> >> Thanks. >> >> >> Bernd > ‑‑ > Ken Gaillot <[email protected]> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
