Re: [ClusterLabs] Upgrade corosync problem

Salvatore D'angelo Mon, 25 Jun 2018 10:06:18 -0700

Hi,

Thanks for reply. I scratched my cluster and created it again and then migrated as before. This time I uninstalled pacemaker, corosync, crmsh and resource agents with

make uninstall

then I installed new packages. The problem is the same, when I launch:

corosync-quorumtool -ps

I got: Cannot initialize QUORUM service

Here the log with debug enabled:

corosync.log
Description: Binary data

[18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap on /dev/shm/qb-cfg-event-18020-18028-23-data

[18019] pg3 corosyncerror   [QB    ] qb_rb_open:cfg-event-18020-18028-23: Resource temporarily unavailable (11)

[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-18020-18028-23-header

[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-18020-18028-23-header

[18019] pg3 corosyncerror   [QB    ] shm connection FAILED: Resource temporarily unavailable (11)

[18019] pg3 corosyncerror   [QB    ] Error in connection setup (18020-18028-23): Resource temporarily unavailable (11)

I tried to check /dev/shm and I am not sure these are the right commands, however:

df -h /dev/shm
Filesystem      Size  Used Avail Use% Mounted on
shm              64M   16M   49M  24% /dev/shm

ls /dev/shm

qb-cmap-request-18020-18036-25-data    qb-corosync-blackbox-data    qb-quorum-request-18020-18095-32-data

qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  qb-quorum-request-18020-18095-32-header

Is 64 Mb enough for /dev/shm. If no, why it worked with previous corosync release?

On 25 Jun 2018, at 09:09, Christine Caulfield <ccaul...@redhat.com> wrote:

On 22/06/18 11:23, Salvatore D'angelo wrote:
Hi,
Here the log:

[17323] pg1 corosyncerror   [QB    ] couldn't create circular mmap on
/dev/shm/qb-cfg-event-17324-17334-23-data
[17323] pg1 corosyncerror   [QB    ]
qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:
/dev/shm/qb-cfg-request-17324-17334-23-header
[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:
/dev/shm/qb-cfg-response-17324-17334-23-header
[17323] pg1 corosyncerror   [QB    ] shm connection FAILED: Resource
temporarily unavailable (11)
[17323] pg1 corosyncerror   [QB    ] Error in connection setup
(17324-17334-23): Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB    ] qb_ipcs_disconnect(17324-17334-23)
state:0

is /dev/shm full?

Chrissie

On 22 Jun 2018, at 12:10, Christine Caulfield <ccaul...@redhat.com> wrote:

On 22/06/18 10:39, Salvatore D'angelo wrote:
Hi,

Can you tell me exactly which log you need. I’ll provide you as soon as possible.

Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.
The old versions of pacemaker, corosync, crash and resource agents were compiled and installed.
I simply downloaded the new versions compiled and installed them. I didn’t get any compliant during ./configure that usually checks for library compatibility.

To be honest I do not know if this is the right approach. Should I “make unistall" old versions before installing the new one?
Which is the suggested approach?
Thank in advance for your help.

OK fair enough!

To be honest the best approach is almost always to get the latest
packages from the distributor rather than compile from source. That way
you can be more sure that upgrades will be more smoothly. Though, to be
honest, I'm not sure how good the Ubuntu packages are (they might be
great, they might not, I genuinely don't know)

When building from source and if you don't know the provenance of the
previous version then I would recommend a 'make uninstall' first - or
removal of the packages if that's where they came from.

One thing you should do is make sure that all the cluster nodes are
running the same version. If some are running older versions then nodes
could drop out for obscure reasons. We try and keep minor versions
on-wire compatible but it's always best to be cautious.

The tidying of your corosync.conf wan wait for the moment, lets get
things mostly working first. If you enable debug logging in corosync.conf:

logging {
      to_syslog: yes
debug: on
}

Then see what happens and post the syslog file that has all of the
corosync messages in it, we'll take it from there.

Chrissie

On 22 Jun 2018, at 11:30, Christine Caulfield <ccaul...@redhat.com> wrote:

On 22/06/18 10:14, Salvatore D'angelo wrote:
Hi Christine,

Thanks for reply. Let me add few details. When I run the corosync
service I se the corosync process running. If I stop it and run:

corosync -f

I see three warnings:
warning [MAIN ] interface section bindnetaddr is used together with
nodelist. Nodelist one is going to be used.
warning [MAIN ] Please migrate config file to nodelist.
warning [MAIN ] Could not set SCHED_RR at priority 99: Operation not
permitted (1)
warning [MAIN ] Could not set priority -2147483648: Permission denied (13)

but I see node joined.

Those certainly need fixing but are probably not the cause. Also why do
you have these values below set?

max_network_delay: 100
retransmits_before_loss_const: 25
window_size: 150

I'm not saying they are causing the trouble, but they aren't going to
help keep a stable cluster.

Without more logs (full logs are always better than just the bits you
think are meaningful) I still can't be sure. it could easily be just
that you've overwritten a packaged version of corosync with your own
compiled one and they have different configure options or that the
libraries now don't match.

Chrissie

My corosync.conf file is below.

With service corosync up and running I have the following output:
*corosync-cfgtool -s*
Printing ring status.
Local node ID 1
RING ID 0
id= 10.0.0.11
status= ring 0 active with no faults
RING ID 1
id= 192.168.0.11
status= ring 1 active with no faults

*corosync-cmapctl | grep members*
runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
ip(192.168.0.11)
runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
ip(192.168.0.12)
runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined

For the moment I have two nodes in my cluster (third node and some
issues and at the moment I did crm node standby on it).

Here the dependency I have installed for corosync (that works fine with
pacemaker 1.1.14 and corosync 2.3.5):
   libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
   libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
   libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
   libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
   libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
   libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
   libqb0_0.16.0.real-1ubuntu4_amd64.deb

*corosync.conf*
---------------------
quorum {
      provider: corosync_votequorum
      expected_votes: 3
}
totem {
      version: 2
      crypto_cipher: none
      crypto_hash: none
      rrp_mode: passive
      interface {
              ringnumber: 0
              bindnetaddr: 10.0.0.0
              mcastport: 5405
              ttl: 1
      }
      interface {
              ringnumber: 1
              bindnetaddr: 192.168.0.0
              mcastport: 5405
              ttl: 1
      }
      transport: udpu
      max_network_delay: 100
      retransmits_before_loss_const: 25
      window_size: 150
}
nodelist {
      node {
              ring0_addr: pg1
              ring1_addr: pg1p
              nodeid: 1
      }
      node {
              ring0_addr: pg2
              ring1_addr: pg2p
              nodeid: 2
      }
      node {
              ring0_addr: pg3
              ring1_addr: pg3p
              nodeid: 3
      }
}
logging {
      to_syslog: yes
}

On 22 Jun 2018, at 09:24, Christine Caulfield <ccaul...@redhat.com
<mailto:ccaul...@redhat.com>> wrote:

On 21/06/18 16:16, Salvatore D'angelo wrote:
Hi,

I upgraded my PostgreSQL/Pacemaker cluster with these versions.
Pacemaker 1.1.14 -> 1.1.18
Corosync 2.3.5 -> 2.4.4
Crmsh 2.2.0 -> 3.0.1
Resource agents 3.9.7 -> 4.1.1

I started on a first node (I am trying one node at a time upgrade).
On a PostgreSQL slave node I did:

*crm node standby <node>*
*service pacemaker stop*
*service corosync stop*

Then I build the tool above as described on their GitHub.com
<http://GitHub.com>
<http://GitHub.com <http://github.com/>> page.

*./autogen.sh (where required)*
*./configure*
*make (where required)*
*make install*

Everything went ok. I expect new file overwrite old one. I left the
dependency I had with old software because I noticed the .configure
didn’t complain.
I started corosync.

*service corosync start*

To verify corosync work properly I used the following commands:
*corosync-cfg-tool -s*
*corosync-cmapctl | grep members*

Everything seemed ok and I verified my node joined the cluster (at least
this is my impression).

Here I verified a problem. Doing the command:
corosync-quorumtool -ps

I got the following problem:
Cannot initialise CFG service

That says that corosync is not running. Have a look in the log files to
see why it stopped. The pacemaker logs below are showing the same thing,
but we can't make any more guesses until we see what corosync itself is
doing. Enabling debug in corosync.conf will also help if more detail is
needed.

Also starting corosync with 'corosync -pf' on the command-line is often
a quick way of checking things are starting OK.

Chrissie

If I try to start pacemaker, I only see pacemaker process running and
pacemaker.log containing the following lines:

/Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
active directory to /var/lib/pacemaker/cores/
/Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
get_cluster_type:Detected an active 'corosync' cluster/
/Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
mcp_read_config:Reading configure for stack: corosync/
/Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
lha-fencing nagios corosync-native atomic-attrd acls/
/Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
file size is: 18446744073709551615/
/Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
qb_ipcs_us_publish:server name: pacemakerd/
/Jun 21 15:09:53 [17115] pg1 pacemakerd: warning:
corosync_node_name:Could not connect to Cluster Configuration Database
API, error CS_ERR_TRY_AGAIN/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
corosync_node_name:Unable to get node name for nodeid 1/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
not obtain a node name for corosync nodeid 1/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
(null)/1 (1 total)/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
has uuid 1/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
is now online/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
cluster_connect_quorum:Could not connect to the Quorum API: 2/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
qb_ipcs_us_withdraw:withdrawing server sockets/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting
pacemakerd/
/Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
crm_xml_cleanup:Cleaning up memory from libxml2/

*What is wrong in my procedure?*

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users


Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Upgrade corosync problem

Reply via email to