Hi List,
Having strange issue lately.
I have two node cluster with some cloned resources on it.
One of my nodes suddenly starts reporting all its resources down (some of them are actually running), stops logging and reminds in this this state forever, while still responding to crm commands.

The curious thing is that restarting corosync/pacemaker doesn't change anything.

Here are the last lines in the log after restart:

Feb 17 12:55:17 [609415] CLUSTER-1 crmd: notice: do_started: The local CRM is operational Feb 17 12:55:17 [609415] CLUSTER-1 crmd: info: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] Feb 17 12:55:17 [609409] CLUSTER-1 cib: info: cib_process_replace: Digest matched on replace from CLUSTER-2: f7cb10ecaff6cfd1661ca7ec779192b3 Feb 17 12:55:17 [609409] CLUSTER-1 cib: info: cib_process_replace: Replaced 0.238.1 with 0.238.40 from CLUSTER-2 Feb 17 12:55:17 [609409] CLUSTER-1 cib: info: cib_replace_notify: Replaced: 0.238.1 -> 0.238.40 from CLUSTER-2 Feb 17 12:55:18 [609415] CLUSTER-1 crmd: info: update_dc: Set DC to CLUSTER-2 (3.0.6) Feb 17 12:55:19 [609411] CLUSTER-1 stonith-ng: info: stonith_command: Processed register from crmd.609415: OK (0) Feb 17 12:55:19 [609411] CLUSTER-1 stonith-ng: info: stonith_command: Processed st_notify from crmd.609415: OK (0) Feb 17 12:55:19 [609411] CLUSTER-1 stonith-ng: info: stonith_command: Processed st_notify from crmd.609415: OK (0) Feb 17 12:55:19 [609415] CLUSTER-1 crmd: info: erase_status_tag: Deleting xpath: //node_state[@uname='CLUSTER-1']/transient_attributes Feb 17 12:55:19 [609415] CLUSTER-1 crmd: info: update_attrd: Connecting to attrd... 5 retries remaining Feb 17 12:55:19 [609415] CLUSTER-1 crmd: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Feb 17 12:55:19 [609413] CLUSTER-1 attrd: notice: attrd_local_callback: Sending full refresh (origin=crmd) Feb 17 12:55:19 [609409] CLUSTER-1 cib: info: cib_process_replace: Digest matched on replace from CLUSTER-2: f7cb10ecaff6cfd1661ca7ec779192b3 Feb 17 12:55:19 [609409] CLUSTER-1 cib: info: cib_process_replace: Replaced 0.238.40 with 0.238.40 from CLUSTER-2 Feb 17 12:55:21 [609413] CLUSTER-1 attrd: warning: attrd_cib_callback: Update shutdown=(null) failed: No such device or address Feb 17 12:55:22 [609413] CLUSTER-1 attrd: warning: attrd_cib_callback: Update terminate=(null) failed: No such device or address Feb 17 12:55:25 [609413] CLUSTER-1 attrd: warning: attrd_cib_callback: Update pingd=(null) failed: No such device or address Feb 17 12:55:26 [609413] CLUSTER-1 attrd: warning: attrd_cib_callback: Update fail-count-p_Samba_Server=(null) failed: No such device or address Feb 17 12:55:26 [609413] CLUSTER-1 attrd: warning: attrd_cib_callback: Update master-p_Device_drbddrv1=(null) failed: No such device or address Feb 17 12:55:27 [609413] CLUSTER-1 attrd: warning: attrd_cib_callback: Update last-failure-p_Samba_Server=(null) failed: No such device or address Feb 17 12:55:27 [609413] CLUSTER-1 attrd: warning: attrd_cib_callback: Update probe_complete=(null) failed: No such device or address

After that the logging on the problematic node stops.

Corosync is v2.1.0.26; Pacemaker v1.1.8

Regards,
Klecho

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to