Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

Jan Friesse Thu, 05 May 2016 02:31:24 -0700

Nikhil

Found the root-cause.
In file schedwrk.c, the function handle2void() uses a union which was not
initialized.
Because of that the handle value was computed incorrectly (lower half was
garbage).


  56 static hdb_handle_t
  57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
  58 static const void *
  59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }

After initializing (as highlighted), the corosync initialization seems to
be going through fine. Will check other things.

Your patch is incorrect and actually doesn't work. As I said (whenpointing you to schedwrk.c), I will send you proper patch, but fix thatissue correctly is not easy.


Regards,
  Honza


-Regards
Nikhil

On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane <nikhil.subscri...@gmail.com>
wrote:

Thanks for your response Dejan.

I do not know yet whether this has anything to do with endianness.
FWIW, there could be something quirky with the system so keeping all
options open. :)

I added some debug prints to understand what's happening under the hood.

*Success case: (on x86 machine): *
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
181272839
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
my_high_delivered=1
[TOTEM ] Delivering 1 to 2
[TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
[SYNC  ] Nikhil: Entering sync_barrier_handler
[SYNC  ] Committing synchronization for corosync configuration map access
.
[TOTEM ] Delivering 2 to 4
[TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
[TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
[CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
left:0)
[SYNC  ] Committing synchronization for corosync cluster closed process
group service v1.01
*[MAIN  ] Completed service synchronization, ready to provide service.*


*Failure case: (on ppc)*:
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
181344357
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
Above message repeats continuously.

So it appears that in failure case I do not receive messages with sequence
number 2-4.
If somebody can throw some ideas that'll help a lot.

-Thanks
Nikhil

On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic <deja...@fastmail.fm>
wrote:

Hi,

On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:

As your hardware is probably capable of running ppcle and if you have

an

environment
at hand without too much effort it might pay off to try that.
There are of course distributions out there support corosync on
big-endian architectures
but I don't know if there is an automatized regression for corosync on
big-endian that
would catch big-endian-issues right away with something as current as
your 2.3.5.


No we are not testing big-endian.

So totally agree with Klaus. Give a try to ppcle. Also make sure all
nodes are little-endian. Corosync should work in mixed BE/LE
environment but because it's not tested, it may not work (and it's a
bug, so if ppcle works I will try to fix BE).


I tested a cluster consisting of big endian/little endian nodes
(s390 and x86-64), but that was a while ago. IIRC, all relevant
bugs in corosync got fixed at that time. Don't know what is the
situation with the latest version.

Thanks,

Dejan

Regards,
   Honza


Regards,
Klaus

On 05/02/2016 06:44 AM, Nikhil Utane wrote:

Re-sending as I don't see my post on the thread.

On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
<nikhil.subscri...@gmail.com <mailto:nikhil.subscri...@gmail.com>>

wrote:


     Hi,

     Looking for some guidance here as we are completely blocked
     otherwise :(.

     -Regards
     Nikhil

     On Fri, Apr 29, 2016 at 6:11 PM, Sriram <sriram...@gmail.com
     <mailto:sriram...@gmail.com>> wrote:

         Corrected the subject.

         We went ahead and captured corosync debug logs for our ppc

board.

         After log analysis and comparison with the sucessful logs(
         from x86 machine) ,
         we didnt find *"[ MAIN  ] Completed service synchronization,
         ready to provide service.*" in ppc logs.
         So, looks like corosync is not in a position to accept
         connection from Pacemaker.
         Even I tried with the new corosync.conf with no success.

         Any hints on this issue would be really helpful.

         Attaching ppc_notworking.log, x86_working.log, corosync.conf.

         Regards,
         Sriram



         On Fri, Apr 29, 2016 at 2:44 PM, Sriram <sriram...@gmail.com
         <mailto:sriram...@gmail.com>> wrote:

             Hi,

             I went ahead and made some changes in file system(Like I
             brought in /etc/init.d/corosync and

/etc/init.d/pacemaker,

             /etc/sysconfig ), After that I was able to run  "pcs
             cluster start".
             But it failed with the following error
              # pcs cluster start
             Starting Cluster...
             Starting Pacemaker Cluster Manager[FAILED]
             Error: unable to start pacemaker

             And in the /var/log/pacemaker.log, I saw these errors
             pacemakerd:     info: mcp_read_config:  cmap connection
             setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
             Apr 29 08:53:47 [15863] node_cu pacemakerd:     info:
             mcp_read_config:  cmap connection setup failed:
             CS_ERR_TRY_AGAIN.  Retrying in 5s
             Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
             mcp_read_config:  Could not connect to Cluster
             Configuration Database API, error 6
             Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
             main:     Could not obtain corosync config data, exiting
             Apr 29 08:53:52 [15863] node_cu pacemakerd:     info:
             crm_xml_cleanup:  Cleaning up memory from libxml2


             And in the /var/log/Debuglog, I saw these errors coming
             from corosync
             20160429 085347.487050 <tel:085347.487050> airv_cu
             daemon.warn corosync[12857]:   [QB    ] Denied

connection,

             is not ready (12857-15863-14)
             20160429 085347.487067 <tel:085347.487067> airv_cu
             daemon.info <http://daemon.info> corosync[12857]:   [QB
             ] Denied connection, is not ready (12857-15863-14)


             I browsed the code of libqb to find that it is failing in

https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c


             Line 600 :
             handle_new_connection function

             Line 637:
             if (auth_result == 0 &&
             c->service->serv_fns.connection_accept) {
                     res = c->service->serv_fns.connection_accept(c,
                                              c->euid, c->egid);
                 }
                 if (res != 0) {
                     goto send_response;
                 }

             Any hints on this issue would be really helpful for me to
             go ahead.
             Please let me know if any logs are required,

             Regards,
             Sriram

             On Thu, Apr 28, 2016 at 2:42 PM, Sriram
             <sriram...@gmail.com <mailto:sriram...@gmail.com>>

wrote:


                 Thanks Ken and Emmanuel.
                 Its a big endian machine. I will try with running

"pcs

                 cluster setup" and "pcs cluster start"
                 Inside cluster.py, "service pacemaker start" and
                 "service corosync start" are executed to bring up
                 pacemaker and corosync.
                 Those service scripts and the infrastructure needed

to

                 bring up the processes in the above said manner
                 doesn't exist in my board.
                 As it is a embedded board with the limited memory,
                 full fledged linux is not installed.
                 Just curious to know, what could be reason the
                 pacemaker throws that error.

                 /"cmap connection setup failed: CS_ERR_TRY_AGAIN.
                 Retrying in 1s"

                 /
                 Thanks for response.

                 Regards,
                 Sriram.

                 On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot
                 <kgail...@redhat.com <mailto:kgail...@redhat.com>>

wrote:


                     On 04/27/2016 11:25 AM, emmanuel segura wrote:
                     > you need to use pcs to do everything, pcs
                     cluster setup and pcs
                     > cluster start, try to use the redhat docs for
                     more information.

                     Agreed -- pcs cluster setup will create a proper
                     corosync.conf for you.
                     Your corosync.conf below uses corosync 1 syntax,
                     and there were
                     significant changes in corosync 2. In particular,
                     you don't need the
                     file created in step 4, because pacemaker is no
                     longer launched via a
                     corosync plugin.

                     > 2016-04-27 17:28 GMT+02:00 Sriram
                     <sriram...@gmail.com <mailto:sriram...@gmail.com

                     >> Dear All,
                     >>
                     >> I m trying to use pacemaker and corosync for
                     the clustering requirement that
                     >> came up recently.
                     >> We have cross compiled corosync, pacemaker and
                     pcs(python) for ppc
                     >> environment (Target board where pacemaker and
                     corosync are supposed to run)
                     >> I m having trouble bringing up pacemaker in
                     that environment, though I could
                     >> successfully bring up corosync.
                     >> Any help is welcome.
                     >>
                     >> I m using these versions of pacemaker and

corosync

                     >> [root@node_cu pacemaker]# corosync -v
                     >> Corosync Cluster Engine, version '2.3.5'
                     >> Copyright (c) 2006-2009 Red Hat, Inc.
                     >> [root@node_cu pacemaker]# pacemakerd -$
                     >> Pacemaker 1.1.14
                     >> Written by Andrew Beekhof
                     >>
                     >> For running corosync, I did the following.
                     >> 1. Created the following directories,
                     >>     /var/lib/pacemaker
                     >>     /var/lib/corosync
                     >>     /var/lib/pacemaker
                     >>     /var/lib/pacemaker/cores
                     >>     /var/lib/pacemaker/pengine
                     >>     /var/lib/pacemaker/blackbox
                     >>     /var/lib/pacemaker/cib
                     >>
                     >>
                     >> 2. Created a file called corosync.conf under
                     /etc/corosync folder with the
                     >> following contents
                     >>
                     >> totem {
                     >>
                     >>         version: 2
                     >>         token:          5000
                     >>         token_retransmits_before_loss_const:

                     >>         join:           1000
                     >>         consensus:      7500
                     >>         vsftype:        none
                     >>         max_messages:   20
                     >>         secauth:        off
                     >>         cluster_name:   mycluster
                     >>         transport:      udpu
                     >>         threads:        0
                     >>         clear_node_high_bit: yes
                     >>
                     >>         interface {
                     >>                 ringnumber: 0
                     >>                 # The following three values
                     need to be set based on your
                     >> environment
                     >>                 bindnetaddr: 10.x.x.x
                     >>                 mcastaddr: 226.94.1.1
                     >>                 mcastport: 5405
                     >>         }
                     >>  }
                     >>
                     >>  logging {
                     >>         fileline: off
                     >>         to_syslog: yes
                     >>         to_stderr: no
                     >>         to_syslog: yes
                     >>         logfile: /var/log/corosync.log
                     >>         syslog_facility: daemon
                     >>         debug: on
                     >>         timestamp: on
                     >>  }
                     >>
                     >>  amf {
                     >>         mode: disabled
                     >>  }
                     >>
                     >>  quorum {
                     >>         provider: corosync_votequorum
                     >>  }
                     >>
                     >> nodelist {
                     >>   node {
                     >>         ring0_addr: node_cu
                     >>         nodeid: 1
                     >>        }
                     >> }
                     >>
                     >> 3.  Created authkey under /etc/corosync
                     >>
                     >> 4.  Created a file called pcmk under
                     /etc/corosync/service.d and contents as
                     >> below,
                     >>       cat pcmk
                     >>       service {
                     >>          # Load the Pacemaker Cluster Resource
                     Manager
                     >>          name: pacemaker
                     >>          ver:  1
                     >>       }
                     >>
                     >> 5. Added the node name "node_cu" in /etc/hosts
                     with 10.X.X.X ip
                     >>
                     >> 6. ./corosync -f -p & --> this step started
                     corosync
                     >>
                     >> [root@node_cu pacemaker]# netstat -alpn |

grep

                     -i coros
                     >> udp        0      0 10.X.X.X:61841

  0.0.0.0:*

                     >> 9133/corosync
                     >> udp        0      0 10.X.X.X:5405

0.0.0.0:*

                     >> 9133/corosync
                     >> unix  2      [ ACC ]     STREAM     LISTENING
                        148888 9133/corosync
                     >> @quorum
                     >> unix  2      [ ACC ]     STREAM     LISTENING
                        148884 9133/corosync
                     >> @cmap
                     >> unix  2      [ ACC ]     STREAM     LISTENING
                        148887 9133/corosync
                     >> @votequorum
                     >> unix  2      [ ACC ]     STREAM     LISTENING
                        148885 9133/corosync
                     >> @cfg
                     >> unix  2      [ ACC ]     STREAM     LISTENING
                        148886 9133/corosync
                     >> @cpg
                     >> unix  2      [ ]         DGRAM
                       148840 9133/corosync
                     >>
                     >> 7. ./pacemakerd -f & gives the following error
                     and exits.
                     >> [root@node_cu pacemaker]# pacemakerd -f
                     >> cmap connection setup failed:
                     CS_ERR_TRY_AGAIN.  Retrying in 1s
                     >> cmap connection setup failed:
                     CS_ERR_TRY_AGAIN.  Retrying in 2s
                     >> cmap connection setup failed:
                     CS_ERR_TRY_AGAIN.  Retrying in 3s
                     >> cmap connection setup failed:
                     CS_ERR_TRY_AGAIN.  Retrying in 4s
                     >> cmap connection setup failed:
                     CS_ERR_TRY_AGAIN.  Retrying in 5s
                     >> Could not connect to Cluster Configuration
                     Database API, error 6
                     >>
                     >> Can you please point me, what is missing in
                     these steps ?
                     >>
                     >> Before trying these steps, I tried running

"pcs

                     cluster start", but that
                     >> command fails with "service" script not found.
                     As the root filesystem
                     >> doesn't contain either /etc/init.d/ or
                     /sbin/service
                     >>
                     >> So, the plan is to bring up corosync and
                     pacemaker manually, later do the
                     >> cluster configuration using "pcs" commands.
                     >>
                     >> Regards,
                     >> Sriram
                     >>
                     >>

_______________________________________________

                     >> Users mailing list: Users@clusterlabs.org
                     <mailto:Users@clusterlabs.org>
                     >> http://clusterlabs.org/mailman/listinfo/users
                     >>
                     >> Project Home: http://www.clusterlabs.org
                     >> Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

                     >> Bugs: http://bugs.clusterlabs.org
                     >>
                     >
                     >
                     >


                     _______________________________________________
                     Users mailing list: Users@clusterlabs.org
                     <mailto:Users@clusterlabs.org>
                     http://clusterlabs.org/mailman/listinfo/users

                     Project Home: http://www.clusterlabs.org
                     Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

                     Bugs: http://bugs.clusterlabs.org





         _______________________________________________
         Users mailing list: Users@clusterlabs.org
         <mailto:Users@clusterlabs.org>
         http://clusterlabs.org/mailman/listinfo/users

         Project Home: http://www.clusterlabs.org
         Getting started:
         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
         Bugs: http://bugs.clusterlabs.org





_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

Reply via email to