Re: [ClusterLabs] corosync/pacemaker on ~100 nodes cluser

2016-08-23 Thread Ken Gaillot
On 08/23/2016 11:46 AM, Klaus Wenninger wrote:
> On 08/23/2016 06:26 PM, Radoslaw Garbacz wrote:
>> Hi,
>>
>> I would like to ask for settings (and hardware requirements) to have
>> corosync/pacemaker running on about 100 nodes cluster.
> Actually I had thought that 16 would be the limit for full
> pacemaker-cluster-nodes.
> For larger deployments pacemaker-remote should be the way to go. Were
> you speaking of a cluster with remote-nodes?
> 
> Regards,
> Klaus
>>
>> For now some nodes get totally frozen (high CPU, high network usage),
>> so that even login is not possible. By manipulating
>> corosync/pacemaker/kernel parameters I managed to run it on ~40 nodes
>> cluster, but I am not sure which parameters are critical, how to make
>> it more responsive and how to make the number of nodes even bigger.

16 is a practical limit without special hardware and tuning, so that's
often what companies that offer support for clusters will accept.

I know people have gone well higher than 16 with a lot of optimization,
but I think somewhere between 32 and 64 corosync can't keep up with the
messages. Your 40 nodes sounds about right. I'd be curious to hear what
you had to do (with hardware, OS tuning, and corosync tuning) to get
that far.

As Klaus mentioned, Pacemaker Remote is the preferred way to go beyond
that currently:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html

>> Thanks,
>>
>> -- 
>> Best Regards,
>>
>> Radoslaw Garbacz
>> XtremeData Incorporation

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Solved: pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
Found the 2 reasons:
1) I had to use gcc 4.8 for libqb to use internal memory barries
this still did not solve the crash but changed the way it crashed the subdaemons
2) /usr/var/run is not writable to everyone, but pacemakerd subdaemons want to 
create socket files here with hacluster user, and fail!
I will see if I can create these files in advance with correct permission 
during installation, but : how can I change this directory? looks like its 
libqb, but how can I drive this folder from the daemons? This way I could 
create a /usr/var/run/cluster with permissions and let everything run there.
Thanks
Gabrele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
Cluster Labs - All topics related to open-source clustering welcomed
kwenn...@redhat.com
Data:
23 agosto 2016 15.05.32 CEST
Oggetto:
Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
I found that pacemakerd leaves a core file where I launch it, nad here is the 
output from "mdb core":
sonicle@xstorage1:/sonicle/etc/cluster/corosync# mdb core
Loading modules: [ libc.so.1 ld.so.1 ]
$C
08047a48 libqb.so.0.18.0`qb_thread_lock+0x16(0, feef9875, 8047a9c, fe9eb842, 
fe9ff000, 806fc78)
08047a68 libqb.so.0.18.0`qb_atomic_int_add+0x22(806fd84, 1, 8047a9c, 773)
08047a88 libqb.so.0.18.0`qb_ipcs_ref+0x23(806fc78, fea30960, feef9865, 
fe9de139, fede608f, 806fb58)
08047ab8 libqb.so.0.18.0`qb_ipcs_create+0x68(8057fd9, 0, 0, 8069470, 805302e, 
20)
08047ae8 libcrmcommon.so.3.5.0`mainloop_add_ipc_server+0x77(8057fd9, 0, 
8069470, 8047b64, 0, feffb0a8)
08047b28 main+0x18e(8047b1c, fef726a8, 8047b58, 8052d2f, 1, 8047b64)
08047b58 _start+0x83(1, 8047c70, 0, 8047c8c, 8047ca0, 8047cb4)

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
kwenn...@redhat.com Cluster Labs - All topics related to open-source clustering 
welcomed
Data:
23 agosto 2016 14.30.20 CEST
Oggetto:
Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
About the hacluster/haclient user/group, I staft to think that cib can't 
connect because it's started by pacemakerd with user hacluster, even though 
pacemakerd is started as root.
Instead, just before pacemakerd is able to connect with the same call, but that 
is the root user.
So I tried to run pacemakerd as hacluster, and infact it can't start that way.
I tried then to add the uidgid spec in the corosync.conf, but seems not to work 
anyway.
So ...should I start also corosync as hacluster? Is it safe to run everything 
as root? How can I force pacemakerd to run every child as root?
...if this is the problem...

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 9.07.03 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 08:50 AM, Gabriele Bulfon wrote:
Ok, looks like Corosync now runs fine with its version, but then
pacemakerd fails again with new errors on attrd and other daemons it
tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Just to be sure: You recompiled pacemaker against your new corosync?
Klaus
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with
some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: 

Re: [ClusterLabs] corosync/pacemaker on ~100 nodes cluser

2016-08-23 Thread Klaus Wenninger
On 08/23/2016 06:26 PM, Radoslaw Garbacz wrote:
> Hi,
>
> I would like to ask for settings (and hardware requirements) to have
> corosync/pacemaker running on about 100 nodes cluster.
Actually I had thought that 16 would be the limit for full
pacemaker-cluster-nodes.
For larger deployments pacemaker-remote should be the way to go. Were
you speaking of a cluster with remote-nodes?

Regards,
Klaus
>
> For now some nodes get totally frozen (high CPU, high network usage),
> so that even login is not possible. By manipulating
> corosync/pacemaker/kernel parameters I managed to run it on ~40 nodes
> cluster, but I am not sure which parameters are critical, how to make
> it more responsive and how to make the number of nodes even bigger.
>
> Thanks,
>
> -- 
> Best Regards,
>
> Radoslaw Garbacz
> XtremeData Incorporation
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Entire Group stop on stopping of single Resource (Jan Pokorn?)

2016-08-23 Thread jaspal singla
Thanks,
Jaspal Singla

On Mon, Aug 22, 2016 at 7:42 PM,  wrote:

> Send Users mailing list submissions to
> users@clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@clusterlabs.org
>
> You can reach the person managing the list at
> users-ow...@clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
>1. Re: Mysql slave did not start replication after failure, and
>   read-only IP also remained active on the much outdated slave
>   (Attila Megyeri)
>2. Re: Entire Group stop on stopping of single Resource (Jan Pokorn?)
>3. Re: Mysql slave did not start replication after failure, and
>   read-only IP also remained active on the much outdated slave
>   (Ken Gaillot)
>
>
> --
>
> Message: 1
> Date: Mon, 22 Aug 2016 14:24:28 +0200
> From: Attila Megyeri 
> To: Cluster Labs - All topics related to open-source clustering
> welcomed
> Subject: Re: [ClusterLabs] Mysql slave did not start replication after
> failure, and read-only IP also remained active on the much outdated
> slave
> Message-ID:
>  minerva-soft.local>
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Andrei,
>
> I waited several hours, and nothing happened.
>
> I assume that the RA does not treat this case properly. Mysql was running,
> but the "show slave status" command returned something that the RA was not
> prepared to parse, and instead of reporting a non-readable attribute, it
> returned some generic error, that did not stop the server.
>
> Rgds,
> Attila
>
>
> -Original Message-
> From: Andrei Borzenkov [mailto:arvidj...@gmail.com]
> Sent: Monday, August 22, 2016 11:42 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed <
> users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Mysql slave did not start replication after
> failure, and read-only IP also remained active on the much outdated slave
>
> On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri
>  wrote:
> > Dear community,
> >
> >
> >
> > A few days ago we had an issue in our Mysql M/S replication cluster.
> >
> > We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed
> to be
> > running on the slave if it is not too much behind the master, and if any
> > error occurs, RO VIP is moved to the master.
> >
> >
> >
> > Something happened with the slave Mysql (some disk issue, still
> > investigating), but the problem is, that the slave VIP remained on the
> slave
> > device, even though the slave process was not running, and the server was
> > much outdated.
> >
> >
> >
> > During the issue the following log entries appeared (just an extract as
> it
> > would be too long):
> >
> >
> >
> >
> >
> > Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process
> was
> > not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider
> token
> > timeout increase.
> >
> > Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed,
> forming
> > new configuration.
> >
> > Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process
> was
> > not scheduled for 27065.2559 ms (threshold is 4000. ms). Consider
> token
> > timeout increase.
> >
> > Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership
> (xxx:6720)
> > was formed. Members left: 168362243 168362281 168362282 168362301
> 168362302
> > 168362311 168362312 1
> >
> > Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership
> (xxx:6724)
> > was formed. Members
> >
> > ..
> >
> > Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> >
> > ..
> >
> > Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update:
> Sending
> > flush op to all hosts for: readable (1)
> >
> > ?
> >
> > Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote
> notification
> > for ctdb1
> >
> > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok,
> > IP_CIP=
> >
> > Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM
> operation
> > db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
> >
> > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet
> address
> > xxx/24 with broadcast address  to device eth0
> >
> > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device
> > eth0 up
> >
> > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO:
> > /usr/lib/heartbeat/send_arp -i 200 -r 5 -p
> > 

Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
I found that pacemakerd leaves a core file where I launch it, nad here is the 
output from "mdb core":
sonicle@xstorage1:/sonicle/etc/cluster/corosync# mdb core
Loading modules: [ libc.so.1 ld.so.1 ]
$C
08047a48 libqb.so.0.18.0`qb_thread_lock+0x16(0, feef9875, 8047a9c, fe9eb842, 
fe9ff000, 806fc78)
08047a68 libqb.so.0.18.0`qb_atomic_int_add+0x22(806fd84, 1, 8047a9c, 773)
08047a88 libqb.so.0.18.0`qb_ipcs_ref+0x23(806fc78, fea30960, feef9865, 
fe9de139, fede608f, 806fb58)
08047ab8 libqb.so.0.18.0`qb_ipcs_create+0x68(8057fd9, 0, 0, 8069470, 805302e, 
20)
08047ae8 libcrmcommon.so.3.5.0`mainloop_add_ipc_server+0x77(8057fd9, 0, 
8069470, 8047b64, 0, feffb0a8)
08047b28 main+0x18e(8047b1c, fef726a8, 8047b58, 8052d2f, 1, 8047b64)
08047b58 _start+0x83(1, 8047c70, 0, 8047c8c, 8047ca0, 8047cb4)

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
kwenn...@redhat.com Cluster Labs - All topics related to open-source clustering 
welcomed
Data:
23 agosto 2016 14.30.20 CEST
Oggetto:
Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
About the hacluster/haclient user/group, I staft to think that cib can't 
connect because it's started by pacemakerd with user hacluster, even though 
pacemakerd is started as root.
Instead, just before pacemakerd is able to connect with the same call, but that 
is the root user.
So I tried to run pacemakerd as hacluster, and infact it can't start that way.
I tried then to add the uidgid spec in the corosync.conf, but seems not to work 
anyway.
So ...should I start also corosync as hacluster? Is it safe to run everything 
as root? How can I force pacemakerd to run every child as root?
...if this is the problem...

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 9.07.03 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 08:50 AM, Gabriele Bulfon wrote:
Ok, looks like Corosync now runs fine with its version, but then
pacemakerd fails again with new errors on attrd and other daemons it
tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Just to be sure: You recompiled pacemaker against your new corosync?
Klaus
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with
some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___Users mailing list: 
Users@clusterlabs.orghttp://clusterlabs.org/mailman/listinfo/usersProject Home: 
http://www.clusterlabs.orgGetting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: 
http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
About the hacluster/haclient user/group, I staft to think that cib can't 
connect because it's started by pacemakerd with user hacluster, even though 
pacemakerd is started as root.
Instead, just before pacemakerd is able to connect with the same call, but that 
is the root user.
So I tried to run pacemakerd as hacluster, and infact it can't start that way.
I tried then to add the uidgid spec in the corosync.conf, but seems not to work 
anyway.
So ...should I start also corosync as hacluster? Is it safe to run everything 
as root? How can I force pacemakerd to run every child as root?
...if this is the problem...

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 9.07.03 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 08:50 AM, Gabriele Bulfon wrote:
Ok, looks like Corosync now runs fine with its version, but then
pacemakerd fails again with new errors on attrd and other daemons it
tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Just to be sure: You recompiled pacemaker against your new corosync?
Klaus
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with
some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
Sure I did: created the new corosync package and installed on the dev machine 
before building and creating the new pacemaker package on the dev machine.

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 9.07.03 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 08:50 AM, Gabriele Bulfon wrote:
Ok, looks like Corosync now runs fine with its version, but then
pacemakerd fails again with new errors on attrd and other daemons it
tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Just to be sure: You recompiled pacemaker against your new corosync?
Klaus
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with
some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
Ok, looks like Corosync now runs fine with its version, but then pacemakerd 
fails again with new errors on attrd and other daemons it tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the 
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync.log is 5.1GB in a short period

2016-08-23 Thread Klaus Wenninger
On 08/23/2016 08:31 AM, Kristoffer Grönlund wrote:
> 朱荣  writes:
>
>> Hello:
>> I has a problem about corosync log, my corosync log is increase to 5.1GB in 
>> a short time.
>> Then I check the corosync log, it’s show me the same message in short 
>> period,like the attachment.
>> What happened about corosync? Thank you!
>> my corosync and pacemaker is:corosync-2.3.4-7.el7.x86_64 
>> pacemaker-1.1.13-10.el7x86_64
>>   by zhu rong
> Something is leaking file descriptors in libqb, but the information
> provided does not yield any further hints.

Remember having had an fd-leak in libqb with pacemaker_remote.
Update of libqb did the trick. Maybe it is something generic.
But I would guess that something else is wrong as well - in my
case it was an error-case leaking fds over time - pacemaker_remote
not connected by a cluster-node.

>
> Check the earlier logs to see if you can find any indication as to what
> may be causing the file descriptor leak.
>
> Use crm_report to collect log data and other information from your
> cluster nodes.
>
> Finally, attach log file excerpts as text attachments, not images.
>
> Cheers,
> Kristoffer
>


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync.log is 5.1GB in a short period

2016-08-23 Thread Kristoffer Grönlund
朱荣  writes:

> Hello:
> I has a problem about corosync log, my corosync log is increase to 5.1GB in a 
> short time.
> Then I check the corosync log, it’s show me the same message in short 
> period,like the attachment.
> What happened about corosync? Thank you!
> my corosync and pacemaker is:corosync-2.3.4-7.el7.x86_64 
> pacemaker-1.1.13-10.el7x86_64
>   by zhu rong

Something is leaking file descriptors in libqb, but the information
provided does not yield any further hints.

Check the earlier logs to see if you can find any indication as to what
may be causing the file descriptor leak.

Use crm_report to collect log data and other information from your
cluster nodes.

Finally, attach log file excerpts as text attachments, not images.

Cheers,
Kristoffer

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-23 Thread Jan Pokorný
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
> Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the 
> "--with-corosync".
> How is Corosync looking for his own version?

The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)

The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116

-- 
Jan (Poki)


pgpQE2w5U4jgY.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org