On 11/10/2016 09:47 AM, Toni Tschampke wrote: >> Did your upgrade documentation describe how to update the corosync >> configuration, and did that go well? crmd may be unable to function due >> to lack of quorum information. > > Thanks for this tip, corosync quorum configuration was the cause. > > As we changed validate-with as well as the feature set manually in the > cib, is there a need for issuing the cibadmin --upgrade --force > command or is this command just for changing the schemes? >
Guess no as this would just do automatically (to the latest version then) what you've done manually already. > -- > Mit freundlichen Grüßen > > Toni Tschampke | t...@halle.it > bcs kommunikationslösungen > Inh. Dipl. Ing. Carsten Burkhardt > Harz 51 | 06108 Halle (Saale) | Germany > tel +49 345 29849-0 | fax +49 345 29849-22 > www.b-c-s.de | www.halle.it | www.wivewa.de > > > EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - > IHREM WISSENSVERWALTER FUER IHREN BETRIEB! > > Weitere Informationen erhalten Sie unter www.wivewa.de > > Am 08.11.2016 um 22:51 schrieb Ken Gaillot: >> On 11/07/2016 09:08 AM, Toni Tschampke wrote: >>> We managed to change the validate-with option via workaround (cibadmin >>> export & replace) as setting the value with cibadmin --modify doesn't >>> write the changes to disk. >>> >>> After experimenting with various schemes (xml is correctly interpreted >>> by crmsh) we are still not able to communicate with local crmd. >>> >>> Can someone please help to determine why the local crmd is not >>> responding (we disabled our other nodes to eliminate possible corosync >>> related issues) and runs into errors/timeouts when issuing crmsh or >>> cibadmin related commands. >> >> It occurs to me that wheezy used corosync 1. There were major changes >> from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for >> pacemaker, whereas 2 has quorum built-in. >> >> Did your upgrade documentation describe how to update the corosync >> configuration, and did that go well? crmd may be unable to function due >> to lack of quorum information. >> >>> examples for not working local commands >>> >>> timeout when running cibadmin: (strace attachment) >>>> cibadmin --upgrade --force >>>> Call cib_upgrade failed (-62): Timer expired >>> >>> error when running a crm resource cleanup >>>> crm resource cleanup $vm >>>> Error signing on to the CRMd service >>>> Error performing operation: Transport endpoint is not connected >>> >>> I attached the strace log from running cib_upgrade, does this help to >>> find the cause of the timeout issue? >>> >>> Here is the corosync dump when locally starting pacemaker: >>> >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:1256 >>>> Corosync Cluster Engine ('2.3.6'): started and ready to provide >>>> service. >>>> Nov 07 16:01:59 [24339] nebel1 corosync info [MAIN ] main.c:1257 >>>> Corosync built-in features: dbus rdma monitoring watchdog augeas >>>> systemd upstart xmlconf qdevices snmp pie relro bindnow >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >>>> totemnet.c:248 Initializing transport (UDP/IP Multicast). >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >>>> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: >>>> none hash: none >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >>>> totemnet.c:248 Initializing transport (UDP/IP Multicast). >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >>>> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: >>>> none hash: none >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >>>> totemudp.c:671 The network interface [10.112.0.1] is now up. >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >>>> Service engine loaded: corosync configuration map access [0] >>>> Nov 07 16:01:59 [24339] nebel1 corosync info [QB ] >>>> ipc_setup.c:536 server name: cmap >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >>>> Service engine loaded: corosync configuration service [1] >>>> Nov 07 16:01:59 [24339] nebel1 corosync info [QB ] >>>> ipc_setup.c:536 server name: cfg >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >>>> Service engine loaded: corosync cluster closed process group service >>>> v1.01 [2] >>>> Nov 07 16:01:59 [24339] nebel1 corosync info [QB ] >>>> ipc_setup.c:536 server name: cpg >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >>>> Service engine loaded: corosync profile loading service [4] >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >>>> Service engine loaded: corosync resource monitoring service [6] >>>> Nov 07 16:01:59 [24339] nebel1 corosync info [WD ] wd.c:669 >>>> Watchdog /dev/watchdog is now been tickled by corosync. >>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD ] wd.c:625 >>>> Could not change the Watchdog timeout from 10 to 6 seconds >>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD ] wd.c:464 >>>> resource load_15min missing a recovery key. >>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD ] wd.c:464 >>>> resource memory_used missing a recovery key. >>>> Nov 07 16:01:59 [24339] nebel1 corosync info [WD ] wd.c:581 no >>>> resources configured. >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >>>> Service engine loaded: corosync watchdog service [7] >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >>>> Service engine loaded: corosync cluster quorum service v0.1 [3] >>>> Nov 07 16:01:59 [24339] nebel1 corosync info [QB ] >>>> ipc_setup.c:536 server name: quorum >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >>>> totemudp.c:671 The network interface [10.110.1.1] is now up. >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >>>> totemsrp.c:2095 A new membership (10.112.0.1:348) was formed. Members >>>> joined: 1 >>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:310 >>>> Completed service synchronization, ready to provide service. >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice: main: >>>> Starting Pacemaker 1.1.15 | build=e174ec8 features: generated-manpages >>>> agent-manpages ascii-docs publican-docs ncurses libqb-logging >>>> libqb-ipc lha-fencing upstart systemd nagios corosync-native >>>> atomic-attrd snmp libesmtp acls >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: main: >>>> Maximum core file size is: 18446744073709551615 >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> qb_ipcs_us_publish: server name: pacemakerd >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice: >>>> get_node_name: Could not obtain a node name for corosync nodeid 1 >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> crm_get_peer: Created entry >>>> 283a5061-34c2-4b81-bff9-738533f22277/0x7f8a151931a0 for node (null)/1 >>>> (1 total) >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> crm_get_peer: Node 1 has uuid 1 >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> crm_update_peer_proc: cluster_connect_cpg: Node (null)[1] - >>>> corosync-cpg is now online >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: error: >>>> cluster_connect_quorum: Corosync quorum is not configured >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice: >>>> get_node_name: Defaulting to uname -n for the local corosync node >>>> name >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> crm_get_peer: Node 1 is now known as nebel1 >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Using uid=108 and group=114 for process cib >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Forked child 24342 for process cib >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Forked child 24343 for process stonith-ng >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Forked child 24344 for process lrmd >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Using uid=108 and group=114 for process attrd >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Forked child 24345 for process attrd >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Using uid=108 and group=114 for process pengine >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Forked child 24346 for process pengine >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Using uid=108 and group=114 for process crmd >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> start_child: Forked child 24347 for process crmd >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: main: >>>> Starting mainloop >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> pcmk_cpg_membership: Node 1 joined group pacemakerd >>>> (counter=0.0) >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> pcmk_cpg_membership: Node 1 still member of group pacemakerd >>>> (peer=nebel1, counter=0.0) >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node >>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: >>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node >>>> Nov 07 16:01:59 [24342] nebel1 cib: info: >>>> crm_log_init: Changed active directory to >>>> /var/lib/pacemaker/cores >>>> Nov 07 16:01:59 [24342] nebel1 cib: notice: main: Using >>>> legacy config location: /var/lib/heartbeat/crm >>>> Nov 07 16:01:59 [24342] nebel1 cib: info: >>>> get_cluster_type: Verifying cluster type: 'corosync' >>>> Nov 07 16:01:59 [24342] nebel1 cib: info: >>>> get_cluster_type: Assuming an active 'corosync' cluster >>>> Nov 07 16:01:59 [24342] nebel1 cib: info: >>>> retrieveCib: Reading cluster configuration file >>>> /var/lib/heartbeat/crm/cib.xml (digest: >>>> /var/lib/heartbeat/crm/cib.xml.sig) >>>> Nov 07 16:01:59 [24344] nebel1 lrmd: info: >>>> crm_log_init: Changed active directory to >>>> /var/lib/pacemaker/cores >>>> Nov 07 16:01:59 [24344] nebel1 lrmd: info: >>>> qb_ipcs_us_publish: server name: lrmd >>>> Nov 07 16:01:59 [24344] nebel1 lrmd: info: main: >>>> Starting >>>> Nov 07 16:01:59 [24346] nebel1 pengine: info: >>>> crm_log_init: Changed active directory to >>>> /var/lib/pacemaker/cores >>>> Nov 07 16:01:59 [24346] nebel1 pengine: info: >>>> qb_ipcs_us_publish: server name: pengine >>>> Nov 07 16:01:59 [24346] nebel1 pengine: info: main: >>>> Starting pengine >>>> Nov 07 16:01:59 [24345] nebel1 attrd: info: >>>> crm_log_init: Changed active directory to >>>> /var/lib/pacemaker/cores >>>> Nov 07 16:01:59 [24345] nebel1 attrd: info: main: >>>> Starting up >>>> Nov 07 16:01:59 [24345] nebel1 attrd: info: >>>> get_cluster_type: Verifying cluster type: 'corosync' >>>> Nov 07 16:01:59 [24345] nebel1 attrd: info: >>>> get_cluster_type: Assuming an active 'corosync' cluster >>>> Nov 07 16:01:59 [24345] nebel1 attrd: notice: >>>> crm_cluster_connect: Connecting to cluster infrastructure: >>>> corosync >>>> Nov 07 16:01:59 [24347] nebel1 crmd: info: >>>> crm_log_init: Changed active directory to >>>> /var/lib/pacemaker/cores >>>> Nov 07 16:01:59 [24347] nebel1 crmd: info: main: CRM >>>> Git Version: 1.1.15 (e174ec8) >>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng: info: >>>> crm_log_init: Changed active directory to >>>> /var/lib/pacemaker/cores >>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng: info: >>>> get_cluster_type: Verifying cluster type: 'corosync' >>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng: info: >>>> get_cluster_type: Assuming an active 'corosync' cluster >>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng: notice: >>>> crm_cluster_connect: Connecting to cluster infrastructure: >>>> corosync >>>> Nov 07 16:01:59 [24347] nebel1 crmd: info: do_log: Input >>>> I_STARTUP received in state S_STARTING from crmd_init >>>> Nov 07 16:01:59 [24347] nebel1 crmd: info: >>>> get_cluster_type: Verifying cluster type: 'corosync' >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:02:00 [24342] nebel1 cib: notice: >>>> get_node_name: Could not obtain a node name for corosync nodeid 1 >>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng: notice: >>>> get_node_name: Defaulting to uname -n for the local corosync node >>>> name >>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng: info: >>>> crm_get_peer: Node 1 is now known as nebel1 >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> crm_get_peer: Created entry >>>> f5df58e3-3848-440c-8f6b-d572f8fa9b9c/0x7f0ce1744570 for node (null)/1 >>>> (1 total) >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> crm_get_peer: Node 1 has uuid 1 >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> crm_update_peer_proc: cluster_connect_cpg: Node (null)[1] - >>>> corosync-cpg is now online >>>> Nov 07 16:02:00 [24342] nebel1 cib: notice: >>>> crm_update_peer_state_iter: Node (null) state is now member | >>>> nodeid=1 previous=unknown source=crm_update_peer_proc >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> init_cs_connection_once: Connection to 'corosync': established >>>> Nov 07 16:02:00 [24345] nebel1 attrd: info: main: >>>> Cluster connection active >>>> Nov 07 16:02:00 [24345] nebel1 attrd: info: >>>> qb_ipcs_us_publish: server name: attrd >>>> Nov 07 16:02:00 [24345] nebel1 attrd: info: main: >>>> Accepting attribute updates >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:02:00 [24342] nebel1 cib: notice: >>>> get_node_name: Defaulting to uname -n for the local corosync node >>>> name >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> crm_get_peer: Node 1 is now known as nebel1 >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> qb_ipcs_us_publish: server name: cib_ro >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> qb_ipcs_us_publish: server name: cib_rw >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> qb_ipcs_us_publish: server name: cib_shm >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: cib_init: >>>> Starting cib mainloop >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> pcmk_cpg_membership: Node 1 joined group cib (counter=0.0) >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> pcmk_cpg_membership: Node 1 still member of group cib >>>> (peer=nebel1, counter=0.0) >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> cib_file_backup: Archived previous version as >>>> /var/lib/heartbeat/crm/cib-72.raw >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> cib_file_write_with_digest: Wrote version 0.8464.0 of the CIB >>>> to disk (digest: 5201c56641a95e5117df4184587c3e93) >>>> Nov 07 16:02:00 [24342] nebel1 cib: info: >>>> cib_file_write_with_digest: Reading cluster configuration file >>>> /var/lib/heartbeat/crm/cib.naRhNz (digest: >>>> /var/lib/heartbeat/crm/cib.hLaVCH) >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> do_cib_control: CIB connection established >>>> Nov 07 16:02:00 [24347] nebel1 crmd: notice: >>>> crm_cluster_connect: Connecting to cluster infrastructure: >>>> corosync >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:02:00 [24347] nebel1 crmd: notice: >>>> get_node_name: Could not obtain a node name for corosync nodeid 1 >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> crm_get_peer: Created entry >>>> 43a3b98f-d81d-4cc7-b46e-4512f24db371/0x7f798ff40040 for node (null)/1 >>>> (1 total) >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> crm_get_peer: Node 1 has uuid 1 >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> crm_update_peer_proc: cluster_connect_cpg: Node (null)[1] - >>>> corosync-cpg is now online >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> init_cs_connection_once: Connection to 'corosync': established >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:02:00 [24347] nebel1 crmd: notice: >>>> get_node_name: Defaulting to uname -n for the local corosync node >>>> name >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> crm_get_peer: Node 1 is now known as nebel1 >>>> Nov 07 16:02:00 [24347] nebel1 crmd: info: >>>> peer_update_callback: nebel1 is now in unknown state >>>> Nov 07 16:02:00 [24347] nebel1 crmd: error: >>>> cluster_connect_quorum: Corosync quorum is not configured >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> corosync_node_name: Unable to get node name for nodeid 2 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> corosync_node_name: Unable to get node name for nodeid 2 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: notice: >>>> get_node_name: Could not obtain a node name for corosync nodeid 2 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> crm_get_peer: Created entry >>>> c790c642-6666-4022-bba9-f700e4773b03/0x7f79901428e0 for node (null)/2 >>>> (2 total) >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> crm_get_peer: Node 2 has uuid 2 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> corosync_node_name: Unable to get node name for nodeid 3 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> corosync_node_name: Unable to get node name for nodeid 3 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: notice: >>>> get_node_name: Could not obtain a node name for corosync nodeid 3 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> crm_get_peer: Created entry >>>> 928f8124-4d29-4285-99de-50038d3c3b7e/0x7f7990142a20 for node (null)/3 >>>> (3 total) >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> crm_get_peer: Node 3 has uuid 3 >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> do_ha_control: Connected to the cluster >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> lrmd_ipc_connect: Connecting to lrmd >>>> Nov 07 16:02:01 [24342] nebel1 cib: info: >>>> cib_process_request: Forwarding cib_modify operation for section >>>> nodes to all (origin=local/crmd/3) >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> do_lrm_control: LRM connection established >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> do_started: Delaying start, no membership data >>>> (0000000000100000) >>>> Nov 07 16:02:01 [24342] nebel1 cib: info: >>>> corosync_node_name: Unable to get node name for nodeid 1 >>>> Nov 07 16:02:01 [24342] nebel1 cib: notice: >>>> get_node_name: Defaulting to uname -n for the local corosync node >>>> name >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> parse_notifications: No optional alerts section in cib >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> do_started: Delaying start, no membership data >>>> (0000000000100000) >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> pcmk_cpg_membership: Node 1 joined group crmd (counter=0.0) >>>> Nov 07 16:02:01 [24347] nebel1 crmd: info: >>>> pcmk_cpg_membership: Node 1 still member of group crmd >>>> (peer=nebel1, counter=0.0) >>>> Nov 07 16:02:01 [24342] nebel1 cib: info: >>>> cib_process_request: Completed cib_modify operation for section >>>> nodes: OK (rc=0, origin=nebel1/crmd/3, version=0.8464.0) >>>> Nov 07 16:02:01 [24345] nebel1 attrd: info: >>>> attrd_cib_connect: Connected to the CIB after 2 attempts >>>> Nov 07 16:02:01 [24345] nebel1 attrd: info: main: CIB >>>> connection active >>>> Nov 07 16:02:01 [24345] nebel1 attrd: info: >>>> pcmk_cpg_membership: Node 1 joined group attrd (counter=0.0) >>>> Nov 07 16:02:01 [24345] nebel1 attrd: info: >>>> pcmk_cpg_membership: Node 1 still member of group attrd >>>> (peer=nebel1, counter=0.0) >>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: setup_cib: >>>> Watching for stonith topology changes >>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: >>>> qb_ipcs_us_publish: server name: stonith-ng >>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: main: >>>> Starting stonith-ng mainloop >>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: >>>> pcmk_cpg_membership: Node 1 joined group stonith-ng >>>> (counter=0.0) >>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: >>>> pcmk_cpg_membership: Node 1 still member of group stonith-ng >>>> (peer=nebel1, counter=0.0) >>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: >>>> init_cib_cache_cb: Updating device list from the cib: init >>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: >>>> cib_devices_update: Updating devices to version 0.8464.0 >>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: notice: >>>> unpack_config: On loss of CCM Quorum: Ignore >>>> Nov 07 16:02:02 [24343] nebel1 stonith-ng: notice: >>>> stonith_device_register: Added 'stonith1Nebel2' to the device list >>>> (1 active devices) >>>> Nov 07 16:02:02 [24343] nebel1 stonith-ng: info: >>>> cib_device_update: Device stonith1Nebel1 has been disabled on nebel1: >>>> score=-INFINITY >>> >>> Current cib settings: >>>> cibadmin -Q | grep validate >>>> <cib admin_epoch="0" epoch="8464" num_updates="0" >>>> validate-with="pacemaker-2.4" crm_feature_set="3.0.10" have-quorum="1" >>>> cib-last-written="Fri Nov 4 12:15:30 2016" update-origin="nebel3" >>>> update-client="crm_attribute" update-user="root"> >>> >>> Any help is appreciated, thanks in advance >>> >>> Regards, Toni >>> >>> -- >>> Mit freundlichen Grüßen >>> >>> Toni Tschampke | t...@halle.it >>> bcs kommunikationslösungen >>> Inh. Dipl. Ing. Carsten Burkhardt >>> Harz 51 | 06108 Halle (Saale) | Germany >>> tel +49 345 29849-0 | fax +49 345 29849-22 >>> www.b-c-s.de | www.halle.it | www.wivewa.de >>> >>> >>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - >>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB! >>> >>> Weitere Informationen erhalten Sie unter www.wivewa.de >>> >>> Am 03.11.2016 um 17:42 schrieb Toni Tschampke: >>>> > I'm guessing this change should be instantly written into the xml >>>> file? >>>> > If this is the case something is wrong, greping for validate >>>> gives the >>>> > old string back. >>>> >>>> We found some strange behavior when setting "validate-with" via >>>> cibadmin, corosync.log shows the successful transaction, issuing >>>> cibadmin --query gives the correct value but it is NOT written into >>>> cib.xml. >>>> >>>> We restarted pacemaker and value is reset to pacemaker-1.1 >>>> If signatures for the cib.xml are generated from pacemaker/cib, which >>>> algorithm is used? looks like md5 to me. >>>> >>>> Would it be possible to manual edit the cib.xml and generate a valid >>>> cib.xml.sig to get one step further in debugging process? >>>> >>>> Regards, Toni >>>> >>>> -- >>>> Mit freundlichen Grüßen >>>> >>>> Toni Tschampke | t...@halle.it >>>> bcs kommunikationslösungen >>>> Inh. Dipl. Ing. Carsten Burkhardt >>>> Harz 51 | 06108 Halle (Saale) | Germany >>>> tel +49 345 29849-0 | fax +49 345 29849-22 >>>> www.b-c-s.de | www.halle.it | www.wivewa.de >>>> >>>> >>>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - >>>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB! >>>> >>>> Weitere Informationen erhalten Sie unter www.wivewa.de >>>> >>>> Am 03.11.2016 um 16:39 schrieb Toni Tschampke: >>>>> > I'm going to guess you were using the experimental 1.1 schema >>>>> as the >>>>> > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try >>>>> > changing the validate-with to pacemaker-next or pacemaker-1.2 and >>>>> see if >>>>> > you get better results. Don't edit the file directly though; >>>>> use the >>>>> > cibadmin command so it signs the end result properly. >>>>> > >>>>> > After changing the validate-with, run: >>>>> > >>>>> > crm_verify -x /var/lib/pacemaker/cib/cib.xml >>>>> > >>>>> > and fix any errors that show up. >>>>> >>>>> strange, the location of our cib.xml differs from your path, our >>>>> cib is >>>>> located in /var/lib/heartbeat/crm/ >>>>> >>>>> running cibadmin --modify --xml-text '<cib >>>>> validate-with="pacemaker-1.2"/>' >>>>> >>>>> gave no output but was logged to corosync: >>>>> >>>>> cib: info: cib_perform_op: -- <cib num_updates="0" >>>>> validate-with="pacemaker-1.1"/> >>>>> cib: info: cib_perform_op: ++ <cib admin_epoch="0" >>>>> epoch="8462" >>>>> num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" >>>>> have-quorum="1" cib-last-written="Thu Nov 3 10:05:52 2016" >>>>> update-origin="nebel1" update-client="cibadmin" update-user="root"/> >>>>> >>>>> I'm guessing this change should be instantly written into the xml >>>>> file? >>>>> If this is the case something is wrong, greping for validate gives >>>>> the >>>>> old string back. >>>>> >>>>> <cib admin_epoch="0" epoch="8462" num_updates="0" >>>>> validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1" >>>>> cib-last-written="Thu Nov 3 16:19:51 2016" update-origin="nebel1" >>>>> update-client="cibadmin" update-user="root"> >>>>> >>>>> pacemakerd --features >>>>> Pacemaker 1.1.15 (Build: e174ec8) >>>>> Supporting v3.0.10: >>>>> >>>>> Should the crm_feature_set be updated this way too? I'm guessing >>>>> this is >>>>> done when "cibadmin --upgrade" succeeds? >>>>> >>>>> We just get an timeout error when trying to upgrade it with cibadmin: >>>>> Call cib_upgrade failed (-62): Timer expired >>>>> >>>>> Do have permissions changed from 1.1.7 to 1.1.15? when looking at our >>>>> quite big /var/lib/heartbeat/crm/ folder some permissions changed: >>>>> >>>>> -rw------- 1 hacluster root 80K Nov 1 16:56 cib-31.raw >>>>> -rw-r--r-- 1 hacluster root 32 Nov 1 16:56 cib-31.raw.sig >>>>> -rw------- 1 hacluster haclient 80K Nov 1 18:53 cib-32.raw >>>>> -rw------- 1 hacluster haclient 32 Nov 1 18:53 cib-32.raw.sig >>>>> >>>>> cib-31 was before upgrading, cib-32 after starting upgraded pacemaker >>>>> >>>>> >>>>> -- >>>>> Mit freundlichen Grüßen >>>>> >>>>> Toni Tschampke | t...@halle.it >>>>> bcs kommunikationslösungen >>>>> Inh. Dipl. Ing. Carsten Burkhardt >>>>> Harz 51 | 06108 Halle (Saale) | Germany >>>>> tel +49 345 29849-0 | fax +49 345 29849-22 >>>>> www.b-c-s.de | www.halle.it | www.wivewa.de >>>>> >>>>> >>>>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - >>>>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB! >>>>> >>>>> Weitere Informationen erhalten Sie unter www.wivewa.de >>>>> >>>>> Am 03.11.2016 um 15:39 schrieb Ken Gaillot: >>>>>> On 11/03/2016 05:51 AM, Toni Tschampke wrote: >>>>>>> Hi, >>>>>>> >>>>>>> we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to >>>>>>> jessie >>>>>>> (pacemaker 1.1.15, corosync 2.3.6). >>>>>>> During the upgrade pacemaker was removed (rc) and reinstalled after >>>>>>> from >>>>>>> jessie-backports, same for crmsh. >>>>>>> >>>>>>> Now we are encountering multiple problems: >>>>>>> >>>>>>> First I checked the configuration on a single node running >>>>>>> pacemaker & >>>>>>> corosync which dropped a strange error, followed by multiple lines >>>>>>> stating syntax is wrong. crm configure show then showed up a mixed >>>>>>> view >>>>>>> of xml and crmsh singleline syntax. >>>>>>> >>>>>>>> ERROR: Cannot read schema file >>>>>>> '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or >>>>>>> directory: '/usr/share/pacemaker/pacemaker-1.1.rng' >>>>>> >>>>>> pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker >>>>>> 1.1.12, >>>>>> as it was used to hold experimental new features rather than as the >>>>>> actual next version of the schema. So, the schema skipped to 1.2. >>>>>> >>>>>> I'm going to guess you were using the experimental 1.1 schema as the >>>>>> "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try >>>>>> changing the validate-with to pacemaker-next or pacemaker-1.2 and >>>>>> see if >>>>>> you get better results. Don't edit the file directly though; use the >>>>>> cibadmin command so it signs the end result properly. >>>>>> >>>>>> After changing the validate-with, run: >>>>>> >>>>>> crm_verify -x /var/lib/pacemaker/cib/cib.xml >>>>>> >>>>>> and fix any errors that show up. >>>>>> >>>>>>> When we looked into that folder there was pacemaker-1.0.rng, 1.2 >>>>>>> and so >>>>>>> on. As a quick try we symlinked the 1.2 to 1.1 and the syntax >>>>>>> errors >>>>>>> were gone. When running crm resource show, all resources showed up, >>>>>>> when >>>>>>> running crm_mon -1fA the output was unexpected as it showed all >>>>>>> nodes >>>>>>> offline, with no DC elected: >>>>>>> >>>>>>>> Stack: corosync >>>>>>>> Current DC: NONE >>>>>>>> Last updated: Thu Nov 3 11:11:16 2016 >>>>>>>> Last change: Thu Nov 3 09:54:52 2016 by root via cibadmin on >>>>>>>> nebel1 >>>>>>>> >>>>>>>> *** Resource management is DISABLED *** >>>>>>>> The cluster will not attempt to start, stop or recover services >>>>>>>> >>>>>>>> 3 nodes and 73 resources configured: >>>>>>>> 5 resources DISABLED and 0 BLOCKED from being started due to >>>>>>>> failures >>>>>>>> >>>>>>>> OFFLINE: [ nebel1 nebel2 nebel3 ] >>>>>>> >>>>>>> we tried to manually change dc-version >>>>>>> >>>>>>> when issuing a simple cleanup command I got the following error: >>>>>>> >>>>>>>> crm resource cleanup DrbdBackuppcMs >>>>>>>> Error signing on to the CRMd service >>>>>>>> Error performing operation: Transport endpoint is not connected >>>>>>> >>>>>>> which looks like crmsh is not able to communicate with crmd and >>>>>>> nothing >>>>>>> is logged in this case in corosync.log >>>>>>> >>>>>>> we experimented with multiple config changes (corosync.conf: >>>>>>> pacemaker >>>>>>> ver 0 > 1) >>>>>>> cib-bootstrap-options: cluster-infrastructure from openais to >>>>>>> corosync >>>>>>> >>>>>>>> Package versions: >>>>>>>> cman 3.1.8-1.2+b1 >>>>>>>> corosync 2.3.6-3~bpo8+1 >>>>>>>> crmsh 2.2.0-1~bpo8+1 >>>>>>>> csync2 1.34-2.3+b1 >>>>>>>> dlm-pcmk 3.0.12-3.2+deb7u2 >>>>>>>> libcman3 3.1.8-1.2+b1 >>>>>>>> libcorosync-common4:amd64 2.3.6-3~bpo8+1 >>>>>>>> munin-libvirt-plugins 0.0.6-1 >>>>>>>> pacemaker 1.1.15-2~bpo8+1 >>>>>>>> pacemaker-cli-utils 1.1.15-2~bpo8+1 >>>>>>>> pacemaker-common 1.1.15-2~bpo8+1 >>>>>>>> pacemaker-resource-agents 1.1.15-2~bpo8+1 >>>>>>> >>>>>>>> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 >>>>>>>> GNU/Linux >>>>>>> >>>>>>> I attached our cib before upgrade and after, as well as the one >>>>>>> with >>>>>>> the >>>>>>> mixed syntax and our corosync.conf. >>>>>>> >>>>>>> When we tried to connect a second node to the cluster, pacemaker >>>>>>> starts >>>>>>> it's deamons, starts corosync and dies after 15 tries with >>>>>>> following in >>>>>>> corosync log: >>>>>>> >>>>>>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>>>> (2000ms) >>>>>>>> crmd: info: do_cib_control: Could not connect to the CIB service: >>>>>>>> Transport endpoint is not connected >>>>>>>> crmd: warning: do_cib_control: >>>>>>>> Couldn't complete CIB registration 15 times... pause and retry >>>>>>>> attrd: error: attrd_cib_connect: Signon to CIB failed: >>>>>>>> Transport endpoint is not connected (-107) >>>>>>>> attrd: info: main: Shutting down attribute manager >>>>>>>> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets >>>>>>>> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2 >>>>>>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>>>> (2000ms) >>>>>>>> pacemakerd: warning: pcmk_child_exit: >>>>>>>> The attrd process (12761) can no longer be respawned, >>>>>>>> shutting the cluster down. >>>>>>>> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker >>>>>>> >>>>>>> A third node joins without above error, but crm_mon still shows all >>>>>>> nodes as offline. >>>>>>> >>>>>>> Thanks for any advice how to solve this, I'm out of ideas now. >>>>>>> >>>>>>> Regards, Toni >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org