Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
On 11/10/2016 09:47 AM, Toni Tschampke wrote: >> Did your upgrade documentation describe how to update the corosync >> configuration, and did that go well? crmd may be unable to function due >> to lack of quorum information. > > Thanks for this tip, corosync quorum configuration was the cause. > > As we changed validate-with as well as the feature set manually in the > cib, is there a need for issuing the cibadmin --upgrade --force > command or is this command just for changing the schemes? > Guess no as this would just do automatically (to the latest version then) what you've done manually already. > -- > Mit freundlichen Grüßen > > Toni Tschampke | t...@halle.it > bcs kommunikationslösungen > Inh. Dipl. Ing. Carsten Burkhardt > Harz 51 | 06108 Halle (Saale) | Germany > tel +49 345 29849-0 | fax +49 345 29849-22 > www.b-c-s.de | www.halle.it | www.wivewa.de > > > EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - > IHREM WISSENSVERWALTER FUER IHREN BETRIEB! > > Weitere Informationen erhalten Sie unter www.wivewa.de > > Am 08.11.2016 um 22:51 schrieb Ken Gaillot: >> On 11/07/2016 09:08 AM, Toni Tschampke wrote: >>> We managed to change the validate-with option via workaround (cibadmin >>> export & replace) as setting the value with cibadmin --modify doesn't >>> write the changes to disk. >>> >>> After experimenting with various schemes (xml is correctly interpreted >>> by crmsh) we are still not able to communicate with local crmd. >>> >>> Can someone please help to determine why the local crmd is not >>> responding (we disabled our other nodes to eliminate possible corosync >>> related issues) and runs into errors/timeouts when issuing crmsh or >>> cibadmin related commands. >> >> It occurs to me that wheezy used corosync 1. There were major changes >> from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for >> pacemaker, whereas 2 has quorum built-in. >> >> Did your upgrade documentation describe how to update the corosync >> configuration, and did that go well? crmd may be unable to function due >> to lack of quorum information. >> >>> examples for not working local commands >>> >>> timeout when running cibadmin: (strace attachment) cibadmin --upgrade --force Call cib_upgrade failed (-62): Timer expired >>> >>> error when running a crm resource cleanup crm resource cleanup $vm Error signing on to the CRMd service Error performing operation: Transport endpoint is not connected >>> >>> I attached the strace log from running cib_upgrade, does this help to >>> find the cause of the timeout issue? >>> >>> Here is the corosync dump when locally starting pacemaker: >>> Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:1256 Corosync Cluster Engine ('2.3.6'): started and ready to provide service. Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN ] main.c:1257 Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices snmp pie relro bindnow Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemnet.c:248 Initializing transport (UDP/IP Multicast). Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: none hash: none Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemnet.c:248 Initializing transport (UDP/IP Multicast). Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: none hash: none Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemudp.c:671 The network interface [10.112.0.1] is now up. Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync configuration map access [0] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cmap Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync configuration service [1] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cfg Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync cluster closed process group service v1.01 [2] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cpg Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync profile loading service [4] Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync resource monitoring service [6] Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669 Watchdog /dev/watchdog is now been tickled by corosync. Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625 Could not change the Watchdog
Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
Did your upgrade documentation describe how to update the corosync configuration, and did that go well? crmd may be unable to function due to lack of quorum information. Thanks for this tip, corosync quorum configuration was the cause. As we changed validate-with as well as the feature set manually in the cib, is there a need for issuing the cibadmin --upgrade --force command or is this command just for changing the schemes? -- Mit freundlichen Grüßen Toni Tschampke | t...@halle.it bcs kommunikationslösungen Inh. Dipl. Ing. Carsten Burkhardt Harz 51 | 06108 Halle (Saale) | Germany tel +49 345 29849-0 | fax +49 345 29849-22 www.b-c-s.de | www.halle.it | www.wivewa.de EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - IHREM WISSENSVERWALTER FUER IHREN BETRIEB! Weitere Informationen erhalten Sie unter www.wivewa.de Am 08.11.2016 um 22:51 schrieb Ken Gaillot: On 11/07/2016 09:08 AM, Toni Tschampke wrote: We managed to change the validate-with option via workaround (cibadmin export & replace) as setting the value with cibadmin --modify doesn't write the changes to disk. After experimenting with various schemes (xml is correctly interpreted by crmsh) we are still not able to communicate with local crmd. Can someone please help to determine why the local crmd is not responding (we disabled our other nodes to eliminate possible corosync related issues) and runs into errors/timeouts when issuing crmsh or cibadmin related commands. It occurs to me that wheezy used corosync 1. There were major changes from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for pacemaker, whereas 2 has quorum built-in. Did your upgrade documentation describe how to update the corosync configuration, and did that go well? crmd may be unable to function due to lack of quorum information. examples for not working local commands timeout when running cibadmin: (strace attachment) cibadmin --upgrade --force Call cib_upgrade failed (-62): Timer expired error when running a crm resource cleanup crm resource cleanup $vm Error signing on to the CRMd service Error performing operation: Transport endpoint is not connected I attached the strace log from running cib_upgrade, does this help to find the cause of the timeout issue? Here is the corosync dump when locally starting pacemaker: Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:1256 Corosync Cluster Engine ('2.3.6'): started and ready to provide service. Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN ] main.c:1257 Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices snmp pie relro bindnow Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemnet.c:248 Initializing transport (UDP/IP Multicast). Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: none hash: none Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemnet.c:248 Initializing transport (UDP/IP Multicast). Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: none hash: none Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemudp.c:671 The network interface [10.112.0.1] is now up. Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync configuration map access [0] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cmap Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync configuration service [1] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cfg Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync cluster closed process group service v1.01 [2] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cpg Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync profile loading service [4] Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync resource monitoring service [6] Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669 Watchdog /dev/watchdog is now been tickled by corosync. Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625 Could not change the Watchdog timeout from 10 to 6 seconds Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464 resource load_15min missing a recovery key. Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464 resource memory_used missing a recovery key. Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:581 no resources configured. Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync watchdog service [7] Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ]
Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
On 11/07/2016 09:08 AM, Toni Tschampke wrote: > We managed to change the validate-with option via workaround (cibadmin > export & replace) as setting the value with cibadmin --modify doesn't > write the changes to disk. > > After experimenting with various schemes (xml is correctly interpreted > by crmsh) we are still not able to communicate with local crmd. > > Can someone please help to determine why the local crmd is not > responding (we disabled our other nodes to eliminate possible corosync > related issues) and runs into errors/timeouts when issuing crmsh or > cibadmin related commands. It occurs to me that wheezy used corosync 1. There were major changes from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for pacemaker, whereas 2 has quorum built-in. Did your upgrade documentation describe how to update the corosync configuration, and did that go well? crmd may be unable to function due to lack of quorum information. > examples for not working local commands > > timeout when running cibadmin: (strace attachment) >> cibadmin --upgrade --force >> Call cib_upgrade failed (-62): Timer expired > > error when running a crm resource cleanup >> crm resource cleanup $vm >> Error signing on to the CRMd service >> Error performing operation: Transport endpoint is not connected > > I attached the strace log from running cib_upgrade, does this help to > find the cause of the timeout issue? > > Here is the corosync dump when locally starting pacemaker: > >> Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:1256 >> Corosync Cluster Engine ('2.3.6'): started and ready to provide service. >> Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN ] main.c:1257 >> Corosync built-in features: dbus rdma monitoring watchdog augeas >> systemd upstart xmlconf qdevices snmp pie relro bindnow >> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >> totemnet.c:248 Initializing transport (UDP/IP Multicast). >> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: >> none hash: none >> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >> totemnet.c:248 Initializing transport (UDP/IP Multicast). >> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: >> none hash: none >> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >> totemudp.c:671 The network interface [10.112.0.1] is now up. >> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >> Service engine loaded: corosync configuration map access [0] >> Nov 07 16:01:59 [24339] nebel1 corosync info[QB] >> ipc_setup.c:536 server name: cmap >> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >> Service engine loaded: corosync configuration service [1] >> Nov 07 16:01:59 [24339] nebel1 corosync info[QB] >> ipc_setup.c:536 server name: cfg >> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >> Service engine loaded: corosync cluster closed process group service >> v1.01 [2] >> Nov 07 16:01:59 [24339] nebel1 corosync info[QB] >> ipc_setup.c:536 server name: cpg >> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >> Service engine loaded: corosync profile loading service [4] >> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >> Service engine loaded: corosync resource monitoring service [6] >> Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669 >> Watchdog /dev/watchdog is now been tickled by corosync. >> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625 >> Could not change the Watchdog timeout from 10 to 6 seconds >> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464 >> resource load_15min missing a recovery key. >> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464 >> resource memory_used missing a recovery key. >> Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:581 no >> resources configured. >> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >> Service engine loaded: corosync watchdog service [7] >> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 >> Service engine loaded: corosync cluster quorum service v0.1 [3] >> Nov 07 16:01:59 [24339] nebel1 corosync info[QB] >> ipc_setup.c:536 server name: quorum >> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >> totemudp.c:671 The network interface [10.110.1.1] is now up. >> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] >> totemsrp.c:2095 A new membership (10.112.0.1:348) was formed. Members >> joined: 1 >> Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:310 >> Completed service synchronization, ready to provide service. >> Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice: main: >> Starting Pacemaker 1.1.15 | build=e174ec8 features:
Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
We managed to change the validate-with option via workaround (cibadmin export & replace) as setting the value with cibadmin --modify doesn't write the changes to disk. After experimenting with various schemes (xml is correctly interpreted by crmsh) we are still not able to communicate with local crmd. Can someone please help to determine why the local crmd is not responding (we disabled our other nodes to eliminate possible corosync related issues) and runs into errors/timeouts when issuing crmsh or cibadmin related commands. examples for not working local commands timeout when running cibadmin: (strace attachment) > cibadmin --upgrade --force > Call cib_upgrade failed (-62): Timer expired error when running a crm resource cleanup > crm resource cleanup $vm > Error signing on to the CRMd service > Error performing operation: Transport endpoint is not connected I attached the strace log from running cib_upgrade, does this help to find the cause of the timeout issue? Here is the corosync dump when locally starting pacemaker: Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:1256 Corosync Cluster Engine ('2.3.6'): started and ready to provide service. Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN ] main.c:1257 Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices snmp pie relro bindnow Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemnet.c:248 Initializing transport (UDP/IP Multicast). Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: none hash: none Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemnet.c:248 Initializing transport (UDP/IP Multicast). Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: none hash: none Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemudp.c:671 The network interface [10.112.0.1] is now up. Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync configuration map access [0] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cmap Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync configuration service [1] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cfg Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync cluster closed process group service v1.01 [2] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: cpg Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync profile loading service [4] Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync resource monitoring service [6] Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669 Watchdog /dev/watchdog is now been tickled by corosync. Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625 Could not change the Watchdog timeout from 10 to 6 seconds Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464 resource load_15min missing a recovery key. Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464 resource memory_used missing a recovery key. Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:581 no resources configured. Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync watchdog service [7] Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174 Service engine loaded: corosync cluster quorum service v0.1 [3] Nov 07 16:01:59 [24339] nebel1 corosync info[QB] ipc_setup.c:536 server name: quorum Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemudp.c:671 The network interface [10.110.1.1] is now up. Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ] totemsrp.c:2095 A new membership (10.112.0.1:348) was formed. Members joined: 1 Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:310 Completed service synchronization, ready to provide service. Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice: main: Starting Pacemaker 1.1.15 | build=e174ec8 features: generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc lha-fencing upstart systemd nagios corosync-native atomic-attrd snmp libesmtp acls Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: main: Maximum core file size is: 18446744073709551615 Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: qb_ipcs_us_publish: server name: pacemakerd Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: corosync_node_name: Unable to get node name for nodeid 1 Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice: get_node_name:
Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
> I'm guessing this change should be instantly written into the xml file? > If this is the case something is wrong, greping for validate gives the > old string back. We found some strange behavior when setting "validate-with" via cibadmin, corosync.log shows the successful transaction, issuing cibadmin --query gives the correct value but it is NOT written into cib.xml. We restarted pacemaker and value is reset to pacemaker-1.1 If signatures for the cib.xml are generated from pacemaker/cib, which algorithm is used? looks like md5 to me. Would it be possible to manual edit the cib.xml and generate a valid cib.xml.sig to get one step further in debugging process? Regards, Toni -- Mit freundlichen Grüßen Toni Tschampke | t...@halle.it bcs kommunikationslösungen Inh. Dipl. Ing. Carsten Burkhardt Harz 51 | 06108 Halle (Saale) | Germany tel +49 345 29849-0 | fax +49 345 29849-22 www.b-c-s.de | www.halle.it | www.wivewa.de EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - IHREM WISSENSVERWALTER FUER IHREN BETRIEB! Weitere Informationen erhalten Sie unter www.wivewa.de Am 03.11.2016 um 16:39 schrieb Toni Tschampke: > I'm going to guess you were using the experimental 1.1 schema as the > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try > changing the validate-with to pacemaker-next or pacemaker-1.2 and see if > you get better results. Don't edit the file directly though; use the > cibadmin command so it signs the end result properly. > > After changing the validate-with, run: > >crm_verify -x /var/lib/pacemaker/cib/cib.xml > > and fix any errors that show up. strange, the location of our cib.xml differs from your path, our cib is located in /var/lib/heartbeat/crm/ running cibadmin --modify --xml-text '' gave no output but was logged to corosync: cib: info: cib_perform_op:-- cib: info: cib_perform_op:++ I'm guessing this change should be instantly written into the xml file? If this is the case something is wrong, greping for validate gives the old string back. pacemakerd --features Pacemaker 1.1.15 (Build: e174ec8) Supporting v3.0.10: Should the crm_feature_set be updated this way too? I'm guessing this is done when "cibadmin --upgrade" succeeds? We just get an timeout error when trying to upgrade it with cibadmin: Call cib_upgrade failed (-62): Timer expired Do have permissions changed from 1.1.7 to 1.1.15? when looking at our quite big /var/lib/heartbeat/crm/ folder some permissions changed: -rw--- 1 hacluster root 80K Nov 1 16:56 cib-31.raw -rw-r--r-- 1 hacluster root 32 Nov 1 16:56 cib-31.raw.sig -rw--- 1 hacluster haclient 80K Nov 1 18:53 cib-32.raw -rw--- 1 hacluster haclient 32 Nov 1 18:53 cib-32.raw.sig cib-31 was before upgrading, cib-32 after starting upgraded pacemaker -- Mit freundlichen Grüßen Toni Tschampke | t...@halle.it bcs kommunikationslösungen Inh. Dipl. Ing. Carsten Burkhardt Harz 51 | 06108 Halle (Saale) | Germany tel +49 345 29849-0 | fax +49 345 29849-22 www.b-c-s.de | www.halle.it | www.wivewa.de EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - IHREM WISSENSVERWALTER FUER IHREN BETRIEB! Weitere Informationen erhalten Sie unter www.wivewa.de Am 03.11.2016 um 15:39 schrieb Ken Gaillot: On 11/03/2016 05:51 AM, Toni Tschampke wrote: Hi, we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie (pacemaker 1.1.15, corosync 2.3.6). During the upgrade pacemaker was removed (rc) and reinstalled after from jessie-backports, same for crmsh. Now we are encountering multiple problems: First I checked the configuration on a single node running pacemaker & corosync which dropped a strange error, followed by multiple lines stating syntax is wrong. crm configure show then showed up a mixed view of xml and crmsh singleline syntax. ERROR: Cannot read schema file '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or directory: '/usr/share/pacemaker/pacemaker-1.1.rng' pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12, as it was used to hold experimental new features rather than as the actual next version of the schema. So, the schema skipped to 1.2. I'm going to guess you were using the experimental 1.1 schema as the "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try changing the validate-with to pacemaker-next or pacemaker-1.2 and see if you get better results. Don't edit the file directly though; use the cibadmin command so it signs the end result properly. After changing the validate-with, run: crm_verify -x /var/lib/pacemaker/cib/cib.xml and fix any errors that show up. When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors were gone. When running crm resource show, all resources showed up, when running crm_mon -1fA the output was unexpected as it showed all nodes offline, with no DC elec
Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
> I'm going to guess you were using the experimental 1.1 schema as the > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try > changing the validate-with to pacemaker-next or pacemaker-1.2 and see if > you get better results. Don't edit the file directly though; use the > cibadmin command so it signs the end result properly. > > After changing the validate-with, run: > >crm_verify -x /var/lib/pacemaker/cib/cib.xml > > and fix any errors that show up. strange, the location of our cib.xml differs from your path, our cib is located in /var/lib/heartbeat/crm/ running cibadmin --modify --xml-text '' gave no output but was logged to corosync: cib: info: cib_perform_op:-- validate-with="pacemaker-1.1"/> cib: info: cib_perform_op:++ num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" have-quorum="1" cib-last-written="Thu Nov 3 10:05:52 2016" update-origin="nebel1" update-client="cibadmin" update-user="root"/> I'm guessing this change should be instantly written into the xml file? If this is the case something is wrong, greping for validate gives the old string back. validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1" cib-last-written="Thu Nov 3 16:19:51 2016" update-origin="nebel1" update-client="cibadmin" update-user="root"> pacemakerd --features Pacemaker 1.1.15 (Build: e174ec8) Supporting v3.0.10: Should the crm_feature_set be updated this way too? I'm guessing this is done when "cibadmin --upgrade" succeeds? We just get an timeout error when trying to upgrade it with cibadmin: Call cib_upgrade failed (-62): Timer expired Do have permissions changed from 1.1.7 to 1.1.15? when looking at our quite big /var/lib/heartbeat/crm/ folder some permissions changed: -rw--- 1 hacluster root 80K Nov 1 16:56 cib-31.raw -rw-r--r-- 1 hacluster root 32 Nov 1 16:56 cib-31.raw.sig -rw--- 1 hacluster haclient 80K Nov 1 18:53 cib-32.raw -rw--- 1 hacluster haclient 32 Nov 1 18:53 cib-32.raw.sig cib-31 was before upgrading, cib-32 after starting upgraded pacemaker -- Mit freundlichen Grüßen Toni Tschampke | t...@halle.it bcs kommunikationslösungen Inh. Dipl. Ing. Carsten Burkhardt Harz 51 | 06108 Halle (Saale) | Germany tel +49 345 29849-0 | fax +49 345 29849-22 www.b-c-s.de | www.halle.it | www.wivewa.de EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - IHREM WISSENSVERWALTER FUER IHREN BETRIEB! Weitere Informationen erhalten Sie unter www.wivewa.de Am 03.11.2016 um 15:39 schrieb Ken Gaillot: On 11/03/2016 05:51 AM, Toni Tschampke wrote: Hi, we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie (pacemaker 1.1.15, corosync 2.3.6). During the upgrade pacemaker was removed (rc) and reinstalled after from jessie-backports, same for crmsh. Now we are encountering multiple problems: First I checked the configuration on a single node running pacemaker & corosync which dropped a strange error, followed by multiple lines stating syntax is wrong. crm configure show then showed up a mixed view of xml and crmsh singleline syntax. ERROR: Cannot read schema file '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or directory: '/usr/share/pacemaker/pacemaker-1.1.rng' pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12, as it was used to hold experimental new features rather than as the actual next version of the schema. So, the schema skipped to 1.2. I'm going to guess you were using the experimental 1.1 schema as the "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try changing the validate-with to pacemaker-next or pacemaker-1.2 and see if you get better results. Don't edit the file directly though; use the cibadmin command so it signs the end result properly. After changing the validate-with, run: crm_verify -x /var/lib/pacemaker/cib/cib.xml and fix any errors that show up. When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors were gone. When running crm resource show, all resources showed up, when running crm_mon -1fA the output was unexpected as it showed all nodes offline, with no DC elected: Stack: corosync Current DC: NONE Last updated: Thu Nov 3 11:11:16 2016 Last change: Thu Nov 3 09:54:52 2016 by root via cibadmin on nebel1 *** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services 3 nodes and 73 resources configured: 5 resources DISABLED and 0 BLOCKED from being started due to failures OFFLINE: [ nebel1 nebel2 nebel3 ] we tried to manually change dc-version when issuing a simple cleanup command I got the following error: crm resource cleanup DrbdBackuppcMs Error signing on to the CRMd service Error performing operation: Transport endpoint is not connected which looks like crmsh is not able to communicate with crmd and nothing is logged in thi
Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
On 11/03/2016 05:51 AM, Toni Tschampke wrote: > Hi, > > we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie > (pacemaker 1.1.15, corosync 2.3.6). > During the upgrade pacemaker was removed (rc) and reinstalled after from > jessie-backports, same for crmsh. > > Now we are encountering multiple problems: > > First I checked the configuration on a single node running pacemaker & > corosync which dropped a strange error, followed by multiple lines > stating syntax is wrong. crm configure show then showed up a mixed view > of xml and crmsh singleline syntax. > >> ERROR: Cannot read schema file > '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or > directory: '/usr/share/pacemaker/pacemaker-1.1.rng' pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12, as it was used to hold experimental new features rather than as the actual next version of the schema. So, the schema skipped to 1.2. I'm going to guess you were using the experimental 1.1 schema as the "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try changing the validate-with to pacemaker-next or pacemaker-1.2 and see if you get better results. Don't edit the file directly though; use the cibadmin command so it signs the end result properly. After changing the validate-with, run: crm_verify -x /var/lib/pacemaker/cib/cib.xml and fix any errors that show up. > When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so > on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors > were gone. When running crm resource show, all resources showed up, when > running crm_mon -1fA the output was unexpected as it showed all nodes > offline, with no DC elected: > >> Stack: corosync >> Current DC: NONE >> Last updated: Thu Nov 3 11:11:16 2016 >> Last change: Thu Nov 3 09:54:52 2016 by root via cibadmin on nebel1 >> >> *** Resource management is DISABLED *** >> The cluster will not attempt to start, stop or recover services >> >> 3 nodes and 73 resources configured: >> 5 resources DISABLED and 0 BLOCKED from being started due to failures >> >> OFFLINE: [ nebel1 nebel2 nebel3 ] > > we tried to manually change dc-version > > when issuing a simple cleanup command I got the following error: > >> crm resource cleanup DrbdBackuppcMs >> Error signing on to the CRMd service >> Error performing operation: Transport endpoint is not connected > > which looks like crmsh is not able to communicate with crmd and nothing > is logged in this case in corosync.log > > we experimented with multiple config changes (corosync.conf: pacemaker > ver 0 > 1) > cib-bootstrap-options: cluster-infrastructure from openais to corosync > >> Package versions: >> cman 3.1.8-1.2+b1 >> corosync 2.3.6-3~bpo8+1 >> crmsh 2.2.0-1~bpo8+1 >> csync2 1.34-2.3+b1 >> dlm-pcmk 3.0.12-3.2+deb7u2 >> libcman3 3.1.8-1.2+b1 >> libcorosync-common4:amd64 2.3.6-3~bpo8+1 >> munin-libvirt-plugins 0.0.6-1 >> pacemaker 1.1.15-2~bpo8+1 >> pacemaker-cli-utils 1.1.15-2~bpo8+1 >> pacemaker-common 1.1.15-2~bpo8+1 >> pacemaker-resource-agents 1.1.15-2~bpo8+1 > >> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux > > I attached our cib before upgrade and after, as well as the one with the > mixed syntax and our corosync.conf. > > When we tried to connect a second node to the cluster, pacemaker starts > it's deamons, starts corosync and dies after 15 tries with following in > corosync log: > >> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms) >> crmd: info: do_cib_control: Could not connect to the CIB service: >> Transport endpoint is not connected >> crmd: warning: do_cib_control: >> Couldn't complete CIB registration 15 times... pause and retry >> attrd: error: attrd_cib_connect: Signon to CIB failed: >> Transport endpoint is not connected (-107) >> attrd: info: main: Shutting down attribute manager >> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets >> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2 >> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms) >> pacemakerd: warning: pcmk_child_exit: >> The attrd process (12761) can no longer be respawned, >> shutting the cluster down. >> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker > > A third node joins without above error, but crm_mon still shows all > nodes as offline. > > Thanks for any advice how to solve this, I'm out of ideas now. > > Regards, Toni > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
> You'll want to switch your validate-with schema to a newer schema, and most > likely there will be one or two things that don't validate > anymore. There is the "crm configure upgrade" command, but if crmsh is > having problems you can call cibadmin directly: > > cibadmin --upgrade --force when trying to run this command, I just get an timeout: > Call cib_upgrade failed (-62): Timer expired corosync.log shows the attempt > cib: info: cib_process_request: Forwarding cib_upgrade > operation for section 'all' to all (origin=local/cibadmin/2) I tried to get the current value, either the command is wrong or there is no value set for validate-with > crm_attribute --type crm_config --query --name validate-with > scope=crm_config name=validate-with value=(null) > Error performing operation: No such device or address I would mind increasing the timeout won't fix this, how do I get information which timeout is involved and why it's triggered? I attached the strace dump, hope this helps to figure out where the problem sits. Is there another way to set the correct validate-with option if both options do not work? Regards, Toni -- Mit freundlichen Grüßen Toni Tschampke | t...@halle.it bcs kommunikationslösungen Inh. Dipl. Ing. Carsten Burkhardt Harz 51 | 06108 Halle (Saale) | Germany tel +49 345 29849-0 | fax +49 345 29849-22 www.b-c-s.de | www.halle.it | www.wivewa.de EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA - IHREM WISSENSVERWALTER FUER IHREN BETRIEB! Weitere Informationen erhalten Sie unter www.wivewa.de Am 03.11.2016 um 12:42 schrieb Kristoffer Grönlund: Toni Tschampke writes: Hi, we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie (pacemaker 1.1.15, corosync 2.3.6). During the upgrade pacemaker was removed (rc) and reinstalled after from jessie-backports, same for crmsh. You'll want to switch your validate-with schema to a newer schema, and most likely there will be one or two things that don't validate anymore. There is the "crm configure upgrade" command, but if crmsh is having problems you can call cibadmin directly: cibadmin --upgrade --force Going from 1.1.7 to 1.1.15 is quite a big jump, so there is a lot that could go wrong.. Your configuration looks fine from a first glance, the reason you're getting XML mixed in is because of the missing schema: crmsh can't be sure that it translated the XML to line syntax correctly, so falls back to showing the XML. That should all fix itself by changing the validate-with attribute on the root tag to a newer version. I'm guessing that the errors you are getting when connecting the second node are due to the missing schema, it's hard to tell based on the log snippet attached though. execve("/usr/sbin/cibadmin", ["cibadmin", "--upgrade", "--force"], [/* 42 vars */]) = 0 brk(0) = 0x7fee60f7a000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fee60845000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=60973, ...}) = 0 mmap(NULL, 60973, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fee60836000 close(3)= 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/usr/lib/x86_64-linux-gnu/libcrmcommon.so.3", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p\1\1\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0644, st_size=367264, ...}) = 0 mmap(NULL, 2467216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fee601c4000 mprotect(0x7fee6021a000, 2093056, PROT_NONE) = 0 mmap(0x7fee60419000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x55000) = 0x7fee60419000 mmap(0x7fee6041e000, 1424, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fee6041e000 close(3)= 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/usr/lib/x86_64-linux-gnu/libcib.so.4", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0Ps\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0644, st_size=128320, ...}) = 0 mmap(NULL, 2224936, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fee5ffa4000 mprotect(0x7fee5ffc1000, 2097152, PROT_NONE) = 0 mmap(0x7fee601c1000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d000) = 0x7fee601c1000 mmap(0x7fee601c3000, 808, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fee601c3000 close(3)= 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/usr/lib/x86_64-linux-gnu/libqb.so.0", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>
Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie
Toni Tschampke writes: > Hi, > > we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie > (pacemaker 1.1.15, corosync 2.3.6). > During the upgrade pacemaker was removed (rc) and reinstalled after from > jessie-backports, same for crmsh. You'll want to switch your validate-with schema to a newer schema, and most likely there will be one or two things that don't validate anymore. There is the "crm configure upgrade" command, but if crmsh is having problems you can call cibadmin directly: cibadmin --upgrade --force Going from 1.1.7 to 1.1.15 is quite a big jump, so there is a lot that could go wrong.. Your configuration looks fine from a first glance, the reason you're getting XML mixed in is because of the missing schema: crmsh can't be sure that it translated the XML to line syntax correctly, so falls back to showing the XML. That should all fix itself by changing the validate-with attribute on the root tag to a newer version. I'm guessing that the errors you are getting when connecting the second node are due to the missing schema, it's hard to tell based on the log snippet attached though. -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] pacemaker after upgrade from wheezy to jessie
Hi, we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie (pacemaker 1.1.15, corosync 2.3.6). During the upgrade pacemaker was removed (rc) and reinstalled after from jessie-backports, same for crmsh. Now we are encountering multiple problems: First I checked the configuration on a single node running pacemaker & corosync which dropped a strange error, followed by multiple lines stating syntax is wrong. crm configure show then showed up a mixed view of xml and crmsh singleline syntax. > ERROR: Cannot read schema file '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or directory: '/usr/share/pacemaker/pacemaker-1.1.rng' When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors were gone. When running crm resource show, all resources showed up, when running crm_mon -1fA the output was unexpected as it showed all nodes offline, with no DC elected: > Stack: corosync > Current DC: NONE > Last updated: Thu Nov 3 11:11:16 2016 > Last change: Thu Nov 3 09:54:52 2016 by root via cibadmin on nebel1 > > *** Resource management is DISABLED *** > The cluster will not attempt to start, stop or recover services > > 3 nodes and 73 resources configured: > 5 resources DISABLED and 0 BLOCKED from being started due to failures > > OFFLINE: [ nebel1 nebel2 nebel3 ] we tried to manually change dc-version when issuing a simple cleanup command I got the following error: > crm resource cleanup DrbdBackuppcMs > Error signing on to the CRMd service > Error performing operation: Transport endpoint is not connected which looks like crmsh is not able to communicate with crmd and nothing is logged in this case in corosync.log we experimented with multiple config changes (corosync.conf: pacemaker ver 0 > 1) cib-bootstrap-options: cluster-infrastructure from openais to corosync > Package versions: > cman 3.1.8-1.2+b1 > corosync 2.3.6-3~bpo8+1 > crmsh 2.2.0-1~bpo8+1 > csync2 1.34-2.3+b1 > dlm-pcmk 3.0.12-3.2+deb7u2 > libcman3 3.1.8-1.2+b1 > libcorosync-common4:amd64 2.3.6-3~bpo8+1 > munin-libvirt-plugins 0.0.6-1 > pacemaker 1.1.15-2~bpo8+1 > pacemaker-cli-utils 1.1.15-2~bpo8+1 > pacemaker-common 1.1.15-2~bpo8+1 > pacemaker-resource-agents 1.1.15-2~bpo8+1 > Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux I attached our cib before upgrade and after, as well as the one with the mixed syntax and our corosync.conf. When we tried to connect a second node to the cluster, pacemaker starts it's deamons, starts corosync and dies after 15 tries with following in corosync log: > crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms) > crmd: info: do_cib_control: Could not connect to the CIB service: > Transport endpoint is not connected > crmd: warning: do_cib_control: > Couldn't complete CIB registration 15 times... pause and retry > attrd: error: attrd_cib_connect: Signon to CIB failed: > Transport endpoint is not connected (-107) > attrd: info: main: Shutting down attribute manager > attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets > attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2 > crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms) > pacemakerd: warning: pcmk_child_exit: > The attrd process (12761) can no longer be respawned, > shutting the cluster down. > pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker A third node joins without above error, but crm_mon still shows all nodes as offline. Thanks for any advice how to solve this, I'm out of ideas now. Regards, Toni -- Mit freundlichen Grüßen Toni Tschampke | t...@halle.it bcs kommunikationslösungen Inh. Dipl. Ing. Carsten Burkhardt Harz 51 | 06108 Halle (Saale) | Germany tel +49 345 29849-0 | fax +49 345 29849-22 www.b-c-s.de | www.halle.it | www.wivewa.de node nebel1 \ utilization memory=61440 \ attributes standby=off node nebel2 \ utilization memory=61440 \ attributes standby=on node nebel3 \ utilization memory=6144 \ attributes standby=on primitive ClusterEmail MailTo \ params email="clus...@bcs.bcs" subject="[cluster]" \ meta allow-migrate=true target-role=Stopped primitive ClusterIp IPaddr2 \ params ip=10.110.2.1 cidr_netmask=16 \ op monitor interval=30 primitive ClusterMon ClusterMon \ params extra_options="-r -f -A -o" htmlfile="/var/www/cluster-status.html" \ operations $id=ClusterMon-operations \ op monitor interval=60 start-delay=0 timeout=30 \ meta target-role=started primitive DhcpDaemon lsb:isc-dhcp-server \ op start interval=0 timeout=30 \ op stop interval=0 timeout=30 \ op monitor interval=60 \ meta target-role=Started primitive DrbdAptcacher ocf:linbit:drbd \ params drbd_resource=apt-cacher \ operations $id=DrbdAptcacher-operations \ op s