On 22/06/18 10:14, Salvatore D'angelo wrote: > Hi Christine, > > Thanks for reply. Let me add few details. When I run the corosync > service I se the corosync process running. If I stop it and run: > > corosync -f > > I see three warnings: > warning [MAIN ] interface section bindnetaddr is used together with > nodelist. Nodelist one is going to be used. > warning [MAIN ] Please migrate config file to nodelist. > warning [MAIN ] Could not set SCHED_RR at priority 99: Operation not > permitted (1) > warning [MAIN ] Could not set priority -2147483648: Permission denied (13) > > but I see node joined. >
Those certainly need fixing but are probably not the cause. Also why do you have these values below set? max_network_delay: 100 retransmits_before_loss_const: 25 window_size: 150 I'm not saying they are causing the trouble, but they aren't going to help keep a stable cluster. Without more logs (full logs are always better than just the bits you think are meaningful) I still can't be sure. it could easily be just that you've overwritten a packaged version of corosync with your own compiled one and they have different configure options or that the libraries now don't match. Chrissie > My corosync.conf file is below. > > With service corosync up and running I have the following output: > *corosync-cfgtool -s* > Printing ring status. > Local node ID 1 > RING ID 0 > id= 10.0.0.11 > status= ring 0 active with no faults > RING ID 1 > id= 192.168.0.11 > status= ring 1 active with no faults > > *corosync-cmapctl | grep members* > runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0 > runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1) > ip(192.168.0.11) > runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1 > runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined > runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0 > runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1) > ip(192.168.0.12) > runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1 > runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined > > For the moment I have two nodes in my cluster (third node and some > issues and at the moment I did crm node standby on it). > > Here the dependency I have installed for corosync (that works fine with > pacemaker 1.1.14 and corosync 2.3.5): > libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb > libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb > libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb > libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb > libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb > libqb-dev_0.16.0.real-1ubuntu4_amd64.deb > libqb0_0.16.0.real-1ubuntu4_amd64.deb > > *corosync.conf* > --------------------- > quorum { > provider: corosync_votequorum > expected_votes: 3 > } > totem { > version: 2 > crypto_cipher: none > crypto_hash: none > rrp_mode: passive > interface { > ringnumber: 0 > bindnetaddr: 10.0.0.0 > mcastport: 5405 > ttl: 1 > } > interface { > ringnumber: 1 > bindnetaddr: 192.168.0.0 > mcastport: 5405 > ttl: 1 > } > transport: udpu > max_network_delay: 100 > retransmits_before_loss_const: 25 > window_size: 150 > } > nodelist { > node { > ring0_addr: pg1 > ring1_addr: pg1p > nodeid: 1 > } > node { > ring0_addr: pg2 > ring1_addr: pg2p > nodeid: 2 > } > node { > ring0_addr: pg3 > ring1_addr: pg3p > nodeid: 3 > } > } > logging { > to_syslog: yes > } > > > > >> On 22 Jun 2018, at 09:24, Christine Caulfield <ccaul...@redhat.com >> <mailto:ccaul...@redhat.com>> wrote: >> >> On 21/06/18 16:16, Salvatore D'angelo wrote: >>> Hi, >>> >>> I upgraded my PostgreSQL/Pacemaker cluster with these versions. >>> Pacemaker 1.1.14 -> 1.1.18 >>> Corosync 2.3.5 -> 2.4.4 >>> Crmsh 2.2.0 -> 3.0.1 >>> Resource agents 3.9.7 -> 4.1.1 >>> >>> I started on a first node (I am trying one node at a time upgrade). >>> On a PostgreSQL slave node I did: >>> >>> *crm node standby <node>* >>> *service pacemaker stop* >>> *service corosync stop* >>> >>> Then I build the tool above as described on their GitHub.com >>> <http://GitHub.com> >>> <http://GitHub.com <http://github.com/>> page. >>> >>> *./autogen.sh (where required)* >>> *./configure* >>> *make (where required)* >>> *make install* >>> >>> Everything went ok. I expect new file overwrite old one. I left the >>> dependency I had with old software because I noticed the .configure >>> didn’t complain. >>> I started corosync. >>> >>> *service corosync start* >>> >>> To verify corosync work properly I used the following commands: >>> *corosync-cfg-tool -s* >>> *corosync-cmapctl | grep members* >>> >>> Everything seemed ok and I verified my node joined the cluster (at least >>> this is my impression). >>> >>> Here I verified a problem. Doing the command: >>> corosync-quorumtool -ps >>> >>> I got the following problem: >>> Cannot initialise CFG service >>> >> That says that corosync is not running. Have a look in the log files to >> see why it stopped. The pacemaker logs below are showing the same thing, >> but we can't make any more guesses until we see what corosync itself is >> doing. Enabling debug in corosync.conf will also help if more detail is >> needed. >> >> Also starting corosync with 'corosync -pf' on the command-line is often >> a quick way of checking things are starting OK. >> >> Chrissie >> >> >>> If I try to start pacemaker, I only see pacemaker process running and >>> pacemaker.log containing the following lines: >>> >>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: crm_log_init:Changed >>> active directory to /var/lib/pacemaker/cores/ >>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>> get_cluster_type:Detected an active 'corosync' cluster/ >>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>> mcp_read_config:Reading configure for stack: corosync/ >>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: notice: main:Starting >>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc >>> lha-fencing nagios corosync-native atomic-attrd acls/ >>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: main:Maximum core >>> file size is: 18446744073709551615/ >>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>> qb_ipcs_us_publish:server name: pacemakerd/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: warning: >>> corosync_node_name:Could not connect to Cluster Configuration Database >>> API, error CS_ERR_TRY_AGAIN/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>> corosync_node_name:Unable to get node name for nodeid 1/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: notice: get_node_name:Could >>> not obtain a node name for corosync nodeid 1/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer:Created >>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node >>> (null)/1 (1 total)/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer:Node 1 >>> has uuid 1/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg >>> is now online/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: error: >>> cluster_connect_quorum:Could not connect to the Quorum API: 2/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>> qb_ipcs_us_withdraw:withdrawing server sockets/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: main:Exiting >>> pacemakerd/ >>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>> crm_xml_cleanup:Cleaning up memory from libxml2/ >>> >>> *What is wrong in my procedure?* >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org> >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> >>> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org