>>> damiano giuliani <damianogiulian...@gmail.com> schrieb am 23.07.2021 um >>> 12:52 in Nachricht <CAG=zynm9frnaosre92mjzl_bxyun37ffcuvqxe0qpurg0ms...@mail.gmail.com>: > hi guys thanks for supporting. > the time query isnt the problem, is known that took its time. the network > is 10gbs bonding, quite impossible to sature with queries :=). > the servers are totally overkilled, at database full working loads 20% of > the resources have been used. > checking again the logs what for me is not clear its the cause of the loss > of quorum and then fence the node. > there are no informations into the logs (even into Idrac/ motherboard event > logs). > > the only clear logs are : > 228684] ltaoperdbs03 corosyncnotice [TOTEM ] A processor failed, forming > new configuration.
Hi! I wonder: Would the corosync Blackbox (COROSYNC-BLACKBOX(8)) help? As an alternative you could capture TOTEM packages in some rotating files, trying to find out what was going on. As it seems now, the issue is that a remote node cannot "be seen". Regards, Ulrich > [228684] ltaoperdbs03 corosyncnotice [TOTEM ] A new membership ( > 172.18.2.12:227) was formed. Members left: 1 > [228684] ltaoperdbs03 corosyncnotice [TOTEM ] Failed to receive the leave > message. failed: 1 > [228684] ltaoperdbs03 corosyncwarning [CPG ] downlist left_list: 1 > received > [228684] ltaoperdbs03 corosyncwarning [CPG ] downlist left_list: 1 > received > Jul 13 00:40:37 [228695] ltaoperdbs03 cib: info: > pcmk_cpg_membership: Group cib event 3: ltaoperdbs02 (node 1 pid > 6136) left via cluster exit > Jul 13 00:40:37 [228695] ltaoperdbs03 cib: info: > crm_update_peer_proc: pcmk_cpg_membership: Node ltaoperdbs02[1] - > corosync-cpg is now offline > Jul 13 00:40:37 [228700] ltaoperdbs03 crmd: info: > pcmk_cpg_membership: Group crmd event 3: ltaoperdbs02 (node 1 pid > 6141) left via cluster exit > Jul 13 00:40:37 [228695] ltaoperdbs03 cib: notice: > crm_update_peer_state_iter: Node ltaoperdbs02 state is now lost | nodeid=1 > previous=member source=crm_update_peer_proc > Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning: pe_fence_node: > Cluster node ltaoperdbs02 will be fenced: peer is no longer part of > the cluster > Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning: > determine_online_status: Node ltaoperdbs02 is unclean > > Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: notice: LogNodeActions: > * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster' > Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: notice: LogAction: * > Promote pgsqld:0 ( Slave -> Master ltaoperdbs03 ) > Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: info: LogActions: > Leave pgsqld:1 (Slave ltaoperdbs04) > > > So the cluster works flawessy as expected: as soon ltaoperdbs02 become > "unreachable", it formed a new quorum, fenced the lost node and promoted > the new master. > > What i cant findout is WHY its happened. > there are no useful information into the system logs neither into the > Idrac motherboard logs. > > There is a way to improve or configure a log system for fenced / failed > node? > > Thanks > > Damiano > > Il giorno gio 22 lug 2021 alle ore 15:06 Jehan-Guillaume de Rorthais < > j...@dalibo.com> ha scritto: > >> Hi, >> >> On Wed, 14 Jul 2021 07:58:14 +0200 >> "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> wrote: >> [...] >> > Could it be that a command saturated the network? >> > Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 >> 00:39:28.936 >> > UTC [172262] LOG: duration: 660.329 ms execute <unnamed>: SELECT >> > xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011 >> > ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON >> > f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id >> WHERE >> > xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND >> > o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path >> LIMIT >> > 7265 ; >> >> I doubt such a query could saturate the network. The query time itself >> isn't >> proportional to the result set size. >> >> Moreover, there's only three fields per row and according to their name, I >> doubt the row size is really big. >> >> Plus, imagine the result set is that big, chances are that the frontend >> will >> not be able to cope with it as fast as the network, unless the frontend is >> doing >> nothing really fancy with the dataset. So the frontend itself might >> saturate >> before the network, giving some break to the later. >> >> However, if this query time is unusual, that might illustrate some >> pressure on >> the server by some other mean (CPU ? MEM ? IO ?). Detailed metrics would >> help. >> >> Regards, >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/