Re: [ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

2016-06-16 Thread Martin Schlegel
Hi Jan

Thanks for your super quick response !

We do not use a Network Manager - it's all static on these Ubuntu 14.04 nodes
(/etc/network/interfaces). 

I do not think we did an ifdown on the network interface manually. However, the
IP-Addresses are assigned to bond0 and bond1 - we use 4x physical network
interfaces with 2x bond'ed into a public (bond1) and 2x bond'ed into a private
network (bond0).

Could this have anything to do with it ?

Regards,
Martin Schlegel

___

>From /etc/network/interfaces, i.e. 

auto bond0
iface bond0 inet static
#pre-up /sbin/ethtool -s bond0 speed 1000 duplex full autoneg on
post-up ifenslave bond0 eth0 eth2
pre-down ifenslave -d bond0 eth0 eth2
bond-slaves none
bond-mode 4
bond-lacp-rate fast
bond-miimon 100
bond-downdelay 0
bond-updelay 0
bond-xmit_hash_policy 1
address  [...]

> Jan Friesse  hat am 16. Juni 2016 um 17:55 geschrieben:
> 
> Martin Schlegel napsal(a):
> 
> > Hello everyone,
> > 
> > we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple
> > of
> > months successfully and we have started seeing a faulty ring with unexpected
> >  127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".
> 
> This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with 
> RRP it means BIG problem.
> 
> > We have had this once before and only restarting Corosync (and everything
> > else)
> > on the node showing the unexpected 127.0.0.1 binding made the problem go
> > away.
> > However, in production we obviously would like to avoid this if possible.
> 
> Just don't do ifdown. Never. If you are using NetworkManager (which does 
> ifdown by default if cable is disconnected), use something like 
> NetworkManager-config-server package (it's just change of configuration 
> so you can adopt it to whatever distribution you are using).
> 
> Regards,
>  Honza
> 
> > So from the following description - how can I troubleshoot this issue and/or
> > does anybody have a good idea what might be happening here ?
> > 
> > We run 2x passive rrp rings across different IP-subnets via udpu and we get
> > the
> > following output (all IPs obfuscated) - please notice the unexpected
> > interface
> > binding 127.0.0.1 for host pg2.
> > 
> > If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
> > briefly
> > shows "no faults" but goes back to "FAULTY" seconds later.
> > 
> > Regards,
> > Martin Schlegel
> > _
> > 
> > root@pg1:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 1
> > RING ID 0
> >  id = A.B.C1.5
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = D.E.F1.170
> >  status = Marking ringid 1 interface D.E.F1.170 FAULTY
> > 
> > root@pg2:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 2
> > RING ID 0
> >  id = A.B.C2.88
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = 127.0.0.1
> >  status = Marking ringid 1 interface 127.0.0.1 FAULTY
> > 
> > root@pg3:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 3
> > RING ID 0
> >  id = A.B.C3.236
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = D.E.F3.112
> >  status = Marking ringid 1 interface D.E.F3.112 FAULTY
> > 
> > _
> > 
> > /etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
> > IPs, but are otherwise identical:
> > ===
> > quorum {
> >  provider: corosync_votequorum
> >  expected_votes: 3
> > }
> > 
> > totem {
> >  version: 2
> > 
> >  crypto_cipher: none
> >  crypto_hash: none
> > 
> >  rrp_mode: passive
> >  interface {
> >  ringnumber: 0
> >  bindnetaddr: A.B.C1.0
> >  mcastport: 5405
> >  ttl: 1
> >  }
> >  interface {
> >  ringnumber: 1
> >  bindnetaddr: D.E.F1.64
> >  mcastport: 5405
> >  ttl: 1
> >  }
> >  transport: udpu
> > }
> > 
> > nodelist {
> >  node {
> >  ring0_addr: pg1
> >  ring1_addr: pg1p
> >  nodeid: 1
> >  }
> >  node {
> >  ring0_addr: pg2
> >  ring1_addr: pg2p
> >  nodeid: 2
> >  }
> >  node {
> >  ring0_addr: pg3
> >  ring1_addr: pg3p
> >  nodeid: 3
> >  }
> > }
> > 
> > logging {
> >  to_syslog: yes
> > }
> > 
> > ===
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

2016-06-16 Thread Jan Friesse

Martin Schlegel napsal(a):

Hello everyone,

we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple of
months successfully and we have started seeing a faulty ring with unexpected
  127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".


This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with 
RRP it means BIG problem.




We have had this once before and only restarting Corosync (and everything else)
on the node showing the unexpected 127.0.0.1 binding made the problem go away.
However, in production we obviously would like to avoid this if possible.


Just don't do ifdown. Never. If you are using NetworkManager (which does 
ifdown by default if cable is disconnected), use something like 
NetworkManager-config-server package (it's just change of configuration 
so you can adopt it to whatever distribution you are using).


Regards,
  Honza



So from the following description - how can I troubleshoot this issue and/or
does anybody have a good idea what might be happening here ?

We run 2x passive rrp rings across different IP-subnets via udpu and we get the
following output (all IPs obfuscated) - please notice the unexpected interface
binding 127.0.0.1 for host pg2.

If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1 briefly
shows "no faults" but goes back to "FAULTY" seconds later.

Regards,
Martin Schlegel
_

root@pg1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
 id  = A.B.C1.5
 status  = ring 0 active with no faults
RING ID 1
 id  = D.E.F1.170
 status  = Marking ringid 1 interface D.E.F1.170 FAULTY

root@pg2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
 id  = A.B.C2.88
 status  = ring 0 active with no faults
RING ID 1
 id  = 127.0.0.1
 status  = Marking ringid 1 interface 127.0.0.1 FAULTY

root@pg3:~# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
 id  = A.B.C3.236
 status  = ring 0 active with no faults
RING ID 1
 id  = D.E.F3.112
 status  = Marking ringid 1 interface D.E.F3.112 FAULTY


_


/etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
IPs, but are otherwise identical:
===
quorum {
 provider: corosync_votequorum
 expected_votes: 3
}

totem {
 version: 2

 crypto_cipher: none
 crypto_hash: none

 rrp_mode: passive
 interface {
 ringnumber: 0
 bindnetaddr: A.B.C1.0
 mcastport: 5405
 ttl: 1
 }
 interface {
 ringnumber: 1
 bindnetaddr: D.E.F1.64
 mcastport: 5405
 ttl: 1
 }
 transport: udpu
}

nodelist {
 node {
 ring0_addr: pg1
 ring1_addr: pg1p
 nodeid: 1
 }
 node {
 ring0_addr: pg2
 ring1_addr: pg2p
 nodeid: 2
 }
 node {
 ring0_addr: pg3
 ring1_addr: pg3p
 nodeid: 3
 }
}

logging {
 to_syslog: yes
}

===

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

2016-06-16 Thread Martin Schlegel
Hello everyone,

we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple of
months successfully and we have started seeing a faulty ring with unexpected
 127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".

We have had this once before and only restarting Corosync (and everything else)
on the node showing the unexpected 127.0.0.1 binding made the problem go away.
However, in production we obviously would like to avoid this if possible.

So from the following description - how can I troubleshoot this issue and/or
does anybody have a good idea what might be happening here ?

We run 2x passive rrp rings across different IP-subnets via udpu and we get the
following output (all IPs obfuscated) - please notice the unexpected interface
binding 127.0.0.1 for host pg2.

If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1 briefly
shows "no faults" but goes back to "FAULTY" seconds later.

Regards,
Martin Schlegel
_

root@pg1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id  = A.B.C1.5
status  = ring 0 active with no faults
RING ID 1
id  = D.E.F1.170
status  = Marking ringid 1 interface D.E.F1.170 FAULTY

root@pg2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id  = A.B.C2.88
status  = ring 0 active with no faults
RING ID 1
id  = 127.0.0.1
status  = Marking ringid 1 interface 127.0.0.1 FAULTY

root@pg3:~# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
id  = A.B.C3.236
status  = ring 0 active with no faults
RING ID 1
id  = D.E.F3.112
status  = Marking ringid 1 interface D.E.F3.112 FAULTY


_


/etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
IPs, but are otherwise identical:
===
quorum {
provider: corosync_votequorum
expected_votes: 3
}

totem {
version: 2

crypto_cipher: none
crypto_hash: none

rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: A.B.C1.0
mcastport: 5405
ttl: 1
}
interface {
ringnumber: 1
bindnetaddr: D.E.F1.64
mcastport: 5405
ttl: 1
}
transport: udpu
}

nodelist {
node {
ring0_addr: pg1
ring1_addr: pg1p
nodeid: 1
}
node {
ring0_addr: pg2
ring1_addr: pg2p
nodeid: 2
}
node {
ring0_addr: pg3
ring1_addr: pg3p
nodeid: 3
}
}

logging {
to_syslog: yes
}

===

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 2.3.6 is available at corosync.org!

2016-06-16 Thread Christine Caulfield
On 16/06/16 14:09, Vladislav Bogdanov wrote:
> 16.06.2016 16:04, Christine Caulfield wrote:
>> On 16/06/16 13:54, Vladislav Bogdanov wrote:
>>> 16.06.2016 15:28, Christine Caulfield wrote:
 On 16/06/16 13:22, Vladislav Bogdanov wrote:
> Hi,
>
> 16.06.2016 14:09, Jan Friesse wrote:
>> I am pleased to announce the latest maintenance release of Corosync
>> 2.3.6 available immediately from our website at
>> http://build.clusterlabs.org/corosync/releases/.
> [...]
>> Christine Caulfield (9):
> [...]
>>  Add some more RO keys
>
> Is there a strong reason to make quorum.wait_for_all read-only?
>

 It's almost a no-op for documentation purposes. corosync has never
 looked at that value after startup anyway. This just makes sure that an
 error will be returned if an attempt is made to change it.
>>>
>>> But it looks at it on a config reload, allowing to change
>>> wait_for_all_status from 0 to 1, but not vice versa. And reload does not
>>> look at "ro" - I though it does. That's fine.
>>> IIUC, even after this change I still have everything working as expected
>>> (I actually did not look at that part of code before):
>>>
>>> Setting wait_for_all to 0 and two_node to 1 in config (both were not set
>>> at all prior to that) and then reload leaves wait_for_all_status=0 and
>>> NODE_FLAGS_WFASTATUS bit unset in flags. But setting wait_for_all to 1
>>> after that (followed by another reload) sets wait_for_all_status=1 and
>>> NODE_FLAGS_WFASTATUS bit.
>>
>> Interesting. I'm not sure that's intended but it sounds safe :) I'll
>> look into it though - if only for my own curiousity.
> 
> Please do not fix^H^H^Hbreak that!

I will be careful :)

Chrissie

> But some inline documentation about the current behavior is worth adding ;)
> 
> 
>>
>>>
>>> Great, thank you!
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 2.3.6 is available at corosync.org!

2016-06-16 Thread Vladislav Bogdanov

16.06.2016 16:04, Christine Caulfield wrote:

On 16/06/16 13:54, Vladislav Bogdanov wrote:

16.06.2016 15:28, Christine Caulfield wrote:

On 16/06/16 13:22, Vladislav Bogdanov wrote:

Hi,

16.06.2016 14:09, Jan Friesse wrote:

I am pleased to announce the latest maintenance release of Corosync
2.3.6 available immediately from our website at
http://build.clusterlabs.org/corosync/releases/.

[...]

Christine Caulfield (9):

[...]

 Add some more RO keys


Is there a strong reason to make quorum.wait_for_all read-only?



It's almost a no-op for documentation purposes. corosync has never
looked at that value after startup anyway. This just makes sure that an
error will be returned if an attempt is made to change it.


But it looks at it on a config reload, allowing to change
wait_for_all_status from 0 to 1, but not vice versa. And reload does not
look at "ro" - I though it does. That's fine.
IIUC, even after this change I still have everything working as expected
(I actually did not look at that part of code before):

Setting wait_for_all to 0 and two_node to 1 in config (both were not set
at all prior to that) and then reload leaves wait_for_all_status=0 and
NODE_FLAGS_WFASTATUS bit unset in flags. But setting wait_for_all to 1
after that (followed by another reload) sets wait_for_all_status=1 and
NODE_FLAGS_WFASTATUS bit.


Interesting. I'm not sure that's intended but it sounds safe :) I'll
look into it though - if only for my own curiousity.


Please do not fix^H^H^Hbreak that!
But some inline documentation about the current behavior is worth adding ;)






Great, thank you!



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 2.3.6 is available at corosync.org!

2016-06-16 Thread Christine Caulfield
On 16/06/16 13:54, Vladislav Bogdanov wrote:
> 16.06.2016 15:28, Christine Caulfield wrote:
>> On 16/06/16 13:22, Vladislav Bogdanov wrote:
>>> Hi,
>>>
>>> 16.06.2016 14:09, Jan Friesse wrote:
 I am pleased to announce the latest maintenance release of Corosync
 2.3.6 available immediately from our website at
 http://build.clusterlabs.org/corosync/releases/.
>>> [...]
 Christine Caulfield (9):
>>> [...]
 Add some more RO keys
>>>
>>> Is there a strong reason to make quorum.wait_for_all read-only?
>>>
>>
>> It's almost a no-op for documentation purposes. corosync has never
>> looked at that value after startup anyway. This just makes sure that an
>> error will be returned if an attempt is made to change it.
> 
> But it looks at it on a config reload, allowing to change
> wait_for_all_status from 0 to 1, but not vice versa. And reload does not
> look at "ro" - I though it does. That's fine.
> IIUC, even after this change I still have everything working as expected
> (I actually did not look at that part of code before):
> 
> Setting wait_for_all to 0 and two_node to 1 in config (both were not set
> at all prior to that) and then reload leaves wait_for_all_status=0 and
> NODE_FLAGS_WFASTATUS bit unset in flags. But setting wait_for_all to 1
> after that (followed by another reload) sets wait_for_all_status=1 and
> NODE_FLAGS_WFASTATUS bit.

Interesting. I'm not sure that's intended but it sounds safe :) I'll
look into it though - if only for my own curiousity.

> 
> Great, thank you!


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 2.3.6 is available at corosync.org!

2016-06-16 Thread Christine Caulfield
On 16/06/16 13:54, Vladislav Bogdanov wrote:
> 16.06.2016 15:28, Christine Caulfield wrote:
>> On 16/06/16 13:22, Vladislav Bogdanov wrote:
>>> Hi,
>>>
>>> 16.06.2016 14:09, Jan Friesse wrote:
 I am pleased to announce the latest maintenance release of Corosync
 2.3.6 available immediately from our website at
 http://build.clusterlabs.org/corosync/releases/.
>>> [...]
 Christine Caulfield (9):
>>> [...]
 Add some more RO keys
>>>
>>> Is there a strong reason to make quorum.wait_for_all read-only?
>>>
>>
>> It's almost a no-op for documentation purposes. corosync has never
>> looked at that value after startup anyway. This just makes sure that an
>> error will be returned if an attempt is made to change it.
> 
> But it looks at it on a config reload, allowing to change
> wait_for_all_status from 0 to 1, but not vice versa. And reload does not
> look at "ro" - I though it does. That's fine.
> IIUC, even after this change I still have everything working as expected
> (I actually did not look at that part of code before):
> 

It doesn't .. or if it does it's a bug! There should be no wait to
change wait_for_all once a node is booted. Doing so threatens quorum.

Chrissie


> Setting wait_for_all to 0 and two_node to 1 in config (both were not set
> at all prior to that) and then reload leaves wait_for_all_status=0 and
> NODE_FLAGS_WFASTATUS bit unset in flags. But setting wait_for_all to 1
> after that (followed by another reload) sets wait_for_all_status=1 and
> NODE_FLAGS_WFASTATUS bit.
> 
> Great, thank you!
> 
> Vladislav
> 
>>
>> Chrissie
>>
>>> In one of products I use the following (fully-automated) actions to
>>> migrate from one-node to two-node setup:
>>>
>>> == mark second node "being joined"
>>> * set quorum.wait_for_all to 0 to make cluster function if node is
>>> reboot/power is lost
>>> * set quorum.two_node to 1
>>> * Add second node to corosync.conf
>>> * reload corosync on a first node
>>> * configure fencing in pacemaker (for both nodes)
>>> * copy corosync.{key,conf} to a second node
>>> * enable/start corosync on the second node
>>> * set quorum.wait_for_all to 1
>>> * copy corosync.conf again to a second node
>>> * reload corosync on both nodes
>>> == Only at this point mark second node "joined"
>>> * enable/start pacemaker on a second node
>>>
>>> I realize that all is a little bit paranoid, but actually it is handy
>>> when you want to predict any problem you are not aware about yet.
>>>
>>> Best regards,
>>> Vladislav
>>>
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 2.3.6 is available at corosync.org!

2016-06-16 Thread Vladislav Bogdanov

16.06.2016 15:28, Christine Caulfield wrote:

On 16/06/16 13:22, Vladislav Bogdanov wrote:

Hi,

16.06.2016 14:09, Jan Friesse wrote:

I am pleased to announce the latest maintenance release of Corosync
2.3.6 available immediately from our website at
http://build.clusterlabs.org/corosync/releases/.

[...]

Christine Caulfield (9):

[...]

Add some more RO keys


Is there a strong reason to make quorum.wait_for_all read-only?



It's almost a no-op for documentation purposes. corosync has never
looked at that value after startup anyway. This just makes sure that an
error will be returned if an attempt is made to change it.


But it looks at it on a config reload, allowing to change 
wait_for_all_status from 0 to 1, but not vice versa. And reload does not 
look at "ro" - I though it does. That's fine.
IIUC, even after this change I still have everything working as expected 
(I actually did not look at that part of code before):


Setting wait_for_all to 0 and two_node to 1 in config (both were not set 
at all prior to that) and then reload leaves wait_for_all_status=0 and 
NODE_FLAGS_WFASTATUS bit unset in flags. But setting wait_for_all to 1 
after that (followed by another reload) sets wait_for_all_status=1 and 
NODE_FLAGS_WFASTATUS bit.


Great, thank you!

Vladislav



Chrissie


In one of products I use the following (fully-automated) actions to
migrate from one-node to two-node setup:

== mark second node "being joined"
* set quorum.wait_for_all to 0 to make cluster function if node is
reboot/power is lost
* set quorum.two_node to 1
* Add second node to corosync.conf
* reload corosync on a first node
* configure fencing in pacemaker (for both nodes)
* copy corosync.{key,conf} to a second node
* enable/start corosync on the second node
* set quorum.wait_for_all to 1
* copy corosync.conf again to a second node
* reload corosync on both nodes
== Only at this point mark second node "joined"
* enable/start pacemaker on a second node

I realize that all is a little bit paranoid, but actually it is handy
when you want to predict any problem you are not aware about yet.

Best regards,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Alert notes

2016-06-16 Thread Ferenc Wágner
Klaus Wenninger  writes:

> On 06/16/2016 11:05 AM, Ferenc Wágner wrote:
>
>> Klaus Wenninger  writes:
>>
>>> On 06/15/2016 06:11 PM, Ferenc Wágner wrote:
>>>
 I think the default timestamp should contain date and time zone
 specification to make it unambigous.
>>>
>>> Idea was to have a trade-off between length and amount of information.
>>
>> I don't think it's worth saving a couple of bytes by dropping this
>> information.  In many cases there will be some way to recover it (from
>> SMTP headers or system logs), but that complicates things.
>
> Wasn't about saving some bytes in the size of a file or so but
> rather to keep readability. If the timestamp fills your screen
> you won't be able to read the actual information...have a look
> at /var/log/messages...
> Pure intention was to have a default that creates a kind of nice-looking
> output together with the file-example to give people an impression
> what they could do with the feature.

I see.  Incidentally, the file example is probably the one which would
profit most of having full timestamps.  And some locking.

>> In a similar vein, keeping the sequence number around would simplify
>> alert ordering and loss detection on the receiver side.  Especially with
>> SNMP, where the transport is unreliable as well.
>
> Nice idea... any OID in mind?

No.  But you can always extend PACEMAKER-MIB.

> Unfortunately the sequence-number we have right now als environment-
> variable is not really fit for this purpuse. It counts up with each
> and every alert being sent on a single node. So if you have multiple
> alerts configured you would experience gaps that prevent you from
> using it as loss-detection.

I see, it isn't per alert, unfortunately.  Still better than nothing,
though...

 (BTW I'd prefer to run the alert scripts as a different user than the
 various Pacemaker components, but that would lead too far now.)
>>>
>>> well, something we thought about already and a point where the
>>> new feature breaks the ClusterMon-Interface.
>>> Unfortunately the impact is quite high - crmd has dropped privileges -
>>> but if the pain-level rises high enough ...
>>
>> There's very little room to do this.  You'd need to configure an alert
>> user and group, and store them in the saved uid/gid set before dropping
>> privileges for the crmd process.  Or use a separate daemon for sending
>> alerts, which feels cleaner.
>
> Yes 2nd daemon was the idea. We don't want to give more rights
> to crmd than it needs. Btw. the daemon is there already: lrmd ;-)

It's running as root already, so at least no problem changing to any
user.  And the default could be hacluster.

>> You are right.  The snmptrap tool does the string->binary conversion if
>> it gets the correct format.  Otherwise, if the length matches, is does a
>> plain cast to binary, interpreting for example 12:34:56.78 as
>> 12594-58-51,52:58:53.54,.55:56.  Looks like the sample SNMP alert agent
>> shouldn't let the uses choose any timestamp-format but
>> %Y-%m-%d,%H:%M:%S.%1N,%:z; unfortunately there's no way to enforce this
>> in the current design. 
>
> Well, generic vs. failsafe  ;-)
> Of course one could introduce something like the metadata in RAs
> to achieve things like that but we wanted to keep the ball flat...
> After all the scripts are just examples...and the timestamp-format
> that should work is given in the header of the script...

More emphasis would help, I think.

>> Maybe it would be more appropriate to get the timestamp from crmd as
>> a high resolution (fractional) epoch all the time, and do the string
>> conversion in the agents as necessary.  One could still control the
>> format via instance_attributes where allowed.  Or keep around the
>> current mechanism as well to reduce code duplication in the agents.
>> Just some ideas...
>
> epoch was actually my first default ...
> additional epoch might be interesting alternative...

It would be useful.  Actually, crm_time_format_hr() currently fails for
any format string ending with any %-escape but N.  For example, "%Yx" is
formatted as "2016x", but "%Y" returns NULL.  You can avoid fixing this
by providing a fractional epoch instead. :)
-- 
Regards,
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 2.3.6 is available at corosync.org!

2016-06-16 Thread Vladislav Bogdanov

Hi,

16.06.2016 14:09, Jan Friesse wrote:

I am pleased to announce the latest maintenance release of Corosync
2.3.6 available immediately from our website at
http://build.clusterlabs.org/corosync/releases/.

[...]

Christine Caulfield (9):

[...]

   Add some more RO keys


Is there a strong reason to make quorum.wait_for_all read-only?

In one of products I use the following (fully-automated) actions to 
migrate from one-node to two-node setup:


== mark second node "being joined"
* set quorum.wait_for_all to 0 to make cluster function if node is 
reboot/power is lost

* set quorum.two_node to 1
* Add second node to corosync.conf
* reload corosync on a first node
* configure fencing in pacemaker (for both nodes)
* copy corosync.{key,conf} to a second node
* enable/start corosync on the second node
* set quorum.wait_for_all to 1
* copy corosync.conf again to a second node
* reload corosync on both nodes
== Only at this point mark second node "joined"
* enable/start pacemaker on a second node

I realize that all is a little bit paranoid, but actually it is handy 
when you want to predict any problem you are not aware about yet.


Best regards,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Alert notes

2016-06-16 Thread Ferenc Wágner
Klaus Wenninger  writes:

> On 06/15/2016 06:11 PM, Ferenc Wágner wrote:
>
>> Please find some random notes about my adventures testing the new alert
>> system.
>>
>> The first alert example in the documentation has no recipient:
>>
>> 
>>
>> In the example above, the cluster will call my-script.sh for each
>> event.
>>
>> while the next section starts as:
>>
>> Each alert may be configured with one or more recipients. The cluster
>> will call the agent separately for each recipient.
>
> The goal of the first example is to be as simple as possible.
> But of course it makes sense to mention that it is not compulsory
> to ad a recipient. And I guess it makes sense to point that out
> as it is just ugly to think that you have to fake a recipient while
> it wouldn't make any sense in your context.

I agree.

>> I think the default timestamp should contain date and time zone
>> specification to make it unambigous.
>
> Idea was to have a trade-off between length and amount of information.

I don't think it's worth saving a couple of bytes by dropping this
information.  In many cases there will be some way to recover it (from
SMTP headers or system logs), but that complicates things.

In a similar vein, keeping the sequence number around would simplify
alert ordering and loss detection on the receiver side.  Especially with
SNMP, where the transport is unreliable as well.

>> (BTW I'd prefer to run the alert scripts as a different user than the
>> various Pacemaker components, but that would lead too far now.)
>
> well, something we thought about already and a point where the
> new feature breaks the ClusterMon-Interface.
> Unfortunately the impact is quite high - crmd has dropped privileges -
> but if the pain-level rises high enough ...

There's very little room to do this.  You'd need to configure an alert
user and group, and store them in the saved uid/gid set before dropping
privileges for the crmd process.  Or use a separate daemon for sending
alerts, which feels cleaner.

>> The SNMP agent seems to have a problem with hrSystemDate, which should
>> be an OCTETSTR with strict format, not some plain textual timestamp.
>> But I haven't really looked into this yet.
>
> Actually I had tried it with the snmptrap-tool coming with rhel-7.2
> and it worked with the string given in the example.
> Did you copy it 1-1? There is a typo in the document having the
> double-quotes double. The format is strict and there are actually
> 2 formats allowed - on with timezone and one without. The
> format string given should match the latter.

You are right.  The snmptrap tool does the string->binary conversion if
it gets the correct format.  Otherwise, if the length matches, is does a
plain cast to binary, interpreting for example 12:34:56.78 as
12594-58-51,52:58:53.54,.55:56.  Looks like the sample SNMP alert agent
shouldn't let the uses choose any timestamp-format but
%Y-%m-%d,%H:%M:%S.%1N,%:z; unfortunately there's no way to enforce this
in the current design.  Maybe it would be more appropriate to get the
timestamp from crmd as a high resolution (fractional) epoch all the
time, and do the string conversion in the agents as necessary.  One
could still control the format via instance_attributes where allowed.
Or keep around the current mechanism as well to reduce code duplication
in the agents.  Just some ideas...
-- 
Regards,
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org