Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-23 Thread Andrew Beekhof
On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers  wrote:
> Andrew Beekhof  wrote:

>> > Well, if you're OK with bending the rules like this then that's good
>> > enough for me to say we should at least try it :)
>>
>> I still say you shouldn't only do it on error.
>
> When else should it be done?

I was thinking whenever a stop() happens.

> IIUC, disabling/enabling the service is independent of the up/down
> state which nova tracks automatically, and which based on slightly
> more than a skim of the code, is dependent on the state of the RPC
> layer.
>
>> > But how would you avoid repeated consecutive invocations of "nova
>> > service-disable" when the monitor action fails, and ditto for "nova
>> > service-enable" when it succeeds?
>>
>> I don't think you can. Not ideal but I'd not have thought a deal breaker.
>
> Sounds like a massive deal-breaker to me!  With op monitor
> interval="10s" and 100 compute nodes, that would mean 10 pointless
> calls to nova-api every second.  Am I missing something?

I was thinking you would only call it for the "I detected a failure
case" and service-enable would still be on start().
So the number of pointless calls per second would be capped at one
tenth of the number of failed compute nodes.

One would hope that all of them weren't dead.

>
> Also I don't see any benefit to moving the API calls from start/stop
> actions to the monitor action.  If there's a failure, Pacemaker will
> invoke the stop action, so we can do service-disable there.

I agree. Doing it unconditionally at stop() is my preferred option, I
was only trying to provide a path that might be close to the behaviour
you were looking for.

> If the
> start action is invoked and we successfully initiate startup of
> nova-compute, the RA can undo any service-disable it previously did
> (although it should not reverse a service-disable done elsewhere,
> e.g. manually by the cloud operator).

Agree

>
>> > Earlier in this thread I proposed
>> > the idea of a tiny temporary file in /run which tracks the last known
>> > state and optimizes away the consecutive invocations, but IIRC you
>> > were against that.
>>
>> I'm generally not a fan, but sometimes state files are a necessity.
>> Just make sure you think through what a missing file might mean.
>
> Sure.  A missing file would mean the RA's never called service-disable
> before,

And that is why I generally don't like state files.
The default location for state files doesn't persist across reboots.

t1. stop (ie. disable)
t2. reboot
t3. start with no state file
t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS

> which means that it shouldn't call service-enable on startup.
>
>> Unless use the state file to store the date at which the last
>> start operation occurred?
>>
>> If we're calling stop() and data - start_date > threshold, then, if
>> you must, be optimistic, skip service-disable and assume we'll get
>> started again soon.
>>
>> Otherwise if we're calling stop() and data - start_date <= threshold,
>> always call service-disable because we're in a restart loop which is
>> not worth optimising for.
>>
>> ( And always call service-enable at start() )
>>
>> No Pacemaker feature or Beekhof approval required :-)
>
> Hmm ...  it's possible I just don't understand this proposal fully,
> but it sounds a bit woolly to me, e.g. how would you decide a suitable
> threshold?

roll a dice?

> I think I preferred your other suggestion of just skipping the
> optimization, i.e. calling service-disable on the first stop, and
> service-enable on (almost) every start.

good :)


And the use of force-down from your subsequent email sounds excellent

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-23 Thread Andrew Beekhof
On Fri, Jun 24, 2016 at 1:26 AM, Adam Spiers  wrote:
> Adam Spiers  wrote:
>> As per the FIXME, one remaining problem is dealing with this kind of
>> scenario:
>>
>>   - Cloud operator notices SMART warnings on the compute node
>> which is not yet causing hard failures but signifies that the
>> hard disk might die soon.
>>
>>   - Operator manually runs "nova service-disable" with the intention
>> of doing some maintenance soon, i.e. live-migrating instances away
>> and replacing the dying hard disk.
>>
>>   - Before the operator gracefully shuts down nova-compute, an I/O
>> error from the disk causes nova-compute to fail.
>>
>>   - Pacemaker invokes the monitor action which spots the failure.
>>
>>   - Pacemaker invokes the stop action which runs service-disable.
>>
>>   - Pacemaker attempts to restart nova-compute by invoking the start
>> action.  Since the disk failure is currently intermittent, we
>> get (un)lucky and nova-compute starts fine.
>>
>> Then it calls service-enable - BAD!  This is now overriding the
>> cloud operator's manual request for the service to be disabled.
>> If we're really unlucky, nova-scheduler will now start up new VMs
>> on the node, even though the hard disk is dying.
>>
>> However I can't see a way to defend against this :-/
>
> OK, I think I figured this out.  The answer is not to use
> service-disable at all, but to use force_down in the same way we
> already use it during fencing.  This means we don't mess with the
> intentions of the cloud operator which were manually specified via
> service-disable.
>
> I asked on #openstack-nova and got confirmation that this made sense.
> Hooray!  Dare I suggest we are finally coming close to a consensus?

I'm sure we can find more to argue over if we put our minds to it :-)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crm_resource --cleanup and cluster-recheck-interval

2016-06-23 Thread Ken Gaillot
On 06/15/2016 05:44 AM, Vladislav Bogdanov wrote:
> Hi,
> 
> It seems that after recent commit which introduces staggered probes
> running 'crm_resource --cleanup' (without --resource) leads to cluster
> to finish recheck too long after cleanup was done. What I see: cluster
> fires probes for the first batch of resources, receives their status,
> writes it to CIB, and then sleep until cluster-recheck-interval lapses.
> After that next batch is reprobed and so on.
> 
> This looks like a regression to me.
> 
> Should I file bug for that or that can be fixed quite fast? IMHO this is
> major enough to be fixed in 1.1.15.
> 
> Best,
> Vladislav

Hi,

Please file a bug, and attach any relevant logs. Obviously, we weren't
able to handle this in time for 1.1.15, but if it's related to staggered
probes, that was introduced in 1.1.14.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-23 Thread Adam Spiers
Adam Spiers  wrote:
> As per the FIXME, one remaining problem is dealing with this kind of
> scenario:
> 
>   - Cloud operator notices SMART warnings on the compute node
> which is not yet causing hard failures but signifies that the
> hard disk might die soon.
> 
>   - Operator manually runs "nova service-disable" with the intention
> of doing some maintenance soon, i.e. live-migrating instances away
> and replacing the dying hard disk.
> 
>   - Before the operator gracefully shuts down nova-compute, an I/O
> error from the disk causes nova-compute to fail.
> 
>   - Pacemaker invokes the monitor action which spots the failure.
> 
>   - Pacemaker invokes the stop action which runs service-disable.
> 
>   - Pacemaker attempts to restart nova-compute by invoking the start
> action.  Since the disk failure is currently intermittent, we
> get (un)lucky and nova-compute starts fine.
> 
> Then it calls service-enable - BAD!  This is now overriding the
> cloud operator's manual request for the service to be disabled.
> If we're really unlucky, nova-scheduler will now start up new VMs
> on the node, even though the hard disk is dying.
> 
> However I can't see a way to defend against this :-/

OK, I think I figured this out.  The answer is not to use
service-disable at all, but to use force_down in the same way we
already use it during fencing.  This means we don't mess with the
intentions of the cloud operator which were manually specified via
service-disable.

I asked on #openstack-nova and got confirmation that this made sense.
Hooray!  Dare I suggest we are finally coming close to a consensus?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-23 Thread Adam Spiers
Andrew Beekhof  wrote:
> On Wed, Jun 15, 2016 at 10:42 PM, Adam Spiers  wrote:
> > Andrew Beekhof  wrote:
> >> On Mon, Jun 13, 2016 at 9:34 PM, Adam Spiers  wrote:
> >> > Andrew Beekhof  wrote:
> >> >> On Wed, Jun 8, 2016 at 6:23 PM, Adam Spiers  wrote:
> >> >> > Andrew Beekhof  wrote:
> >> >> >> On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers  
> >> >> >> wrote:
> >> >> >> > We would also need to ensure that service-enable is called on start
> >> >> >> > when necessary.  Perhaps we could track the enable/disable state 
> >> >> >> > in a
> >> >> >> > local temporary file, and if the file indicates that we've 
> >> >> >> > previously
> >> >> >> > done service-disable, we know to run service-enable on start.  This
> >> >> >> > would avoid calling service-enable on every single start.
> >> >> >>
> >> >> >> feels like an over-optimization
> >> >> >> in fact, the whole thing feels like that if i'm honest.
> >> >> >
> >> >> > Huh ... You didn't seem to think that when we discussed automating
> >> >> > service-disable at length in Austin.
> >> >>
> >> >> I didn't feel the need to push back because RH uses the systemd agent
> >> >> instead so you're only hanging yourself, but more importantly because
> >> >> the proposed implementation to facilitate it wasn't leading RA writers
> >> >> down a hazardous path :-)
> >> >
> >> > I'm a bit confused by that statement, because the only proposed
> >> > implementation we came up with in Austin was adding this new feature
> >> > to Pacemaker.
> >>
> >> _A_ new feature, not _this_ new feature.
> >> The one we discussed was far less prone to being abused but, as it
> >> turns out, also far less useful for what you were trying to do.
> >
> > Was there really that much significant change since the original idea?
> > IIRC the only thing which really changed was the type, from "number of
> > retries remaining" to a boolean "there are still some retries" left.
> 
> The new implementation has nothing to do with retries. Like the new
> name, it is based on "is a start action expected".

Oh yeah, I remember now.

> Thats why I got an attack of the heebie-jeebies.

I'm not sure why, but at least now I understand your change of
position :-)

> > I'm not sure why the integer approach would be far less open to abuse,
> > or even why it would have been far less useful.  I'm probably missing
> > something.
> >
> > [snipped]
> >
> >> >> >> why are we trying to optimise the projected performance impact
> >> >> >
> >> >> > It's not really "projected"; we know exactly what the impact is.  And
> >> >> > it's not really a performance impact either.  If nova-compute (or a
> >> >> > dependency) is malfunctioning on a compute node, there will be a
> >> >> > window (bounded by nova.conf's rpc_response_timeout value, IIUC) in
> >> >> > which nova-scheduler could still schedule VMs onto that compute node,
> >> >> > and then of course they'll fail to boot.
> >> >>
> >> >> Right, but that window exists regardless of whether the node is or is
> >> >> not ever coming back.
> >> >
> >> > Sure, but the window's a *lot* bigger if we don't do service-disable.
> >> > Although perhaps your question "why are we trying to optimise the
> >> > projected performance impact" was actually "why are we trying to avoid
> >> > extra calls to service-disable" rather than "why do we want to call
> >> > service-disable" as I initially assumed.  Is that right?
> >>
> >> Exactly.  I assumed it was to limit the noise we'd be generating in doing 
> >> so.
> >
> > Sort of - not just the noise, but the extra delay introduced by
> > calling service-disable, restarting nova-compute, and then calling
> > service-enable again when it succeeds.
> 
> Ok, but restarting nova-compute is not optional and the bits that are
> optional are all but completely asynchronous* - so the overhead should
> be negligible.
> 
> * Like most API calls, they are Ack'd when the request has been
> received, not processed.

Yes, fair points.

> >> >> > The masakari folks have a lot of operational experience in this space,
> >> >> > and they found that this was enough of a problem to justify calling
> >> >> > nova service-disable whenever the failure is detected.
> >> >>
> >> >> If you really want it whenever the failure is detected, call it from
> >> >> the monitor operation that finds it broken.
> >> >
> >> > Hmm, that appears to violate what I assume would be a fundamental
> >> > design principle of Pacemaker: that the "monitor" action never changes
> >> > the system's state (assuming there are no Heisenberg-like side effects
> >> > of monitoring, of course).
> >>
> >> That has traditionally been the considered a good idea, in the vast
> >> majority of cases I still think it is a good idea, but its also a
> >> guideline that has been broken because there is no other way for the
> >> agent to work *cough* rabbit *cough*.
> >>
> >> In this specific case, I think it could be forgivable because you're
> >> not strictly altering the service but something that sits

Re: [ClusterLabs] Failover problem with dual primary drbd

2016-06-23 Thread Eric Bourguinat
Find it! I looked at /usr/lib/drbd/crm-fence-peer.sh and find my mistake 
when I saw that constraint is placed on hostname.

hostname = ftpprod04 and ftpprod05
But I used pcmk-1 and pcmk-2 in pacemaker config...


Eric


Le 23/06/2016 09:14, Eric Bourguinat a écrit :

My drbd config

drbdadm dump all

# /etc/drbd.conf
global {
usage-count yes;
cmd-timeout-medium 600;
cmd-timeout-long 0;
}

common {
}

# resource home on ftpprod04: not ignored, not stacked
# defined at /etc/drbd.d/home.res:1
resource home {
on ftpprod04 {
device   /dev/drbd1 minor 1;
disk /dev/vghome/lvhome;
meta-diskinternal;
address  ipv4 192.168.122.101:7789;
}
on ftpprod05 {
device   /dev/drbd1 minor 1;
disk /dev/vghome/lvhome;
meta-diskinternal;
address  ipv4 192.168.122.102:7789;
}
net {
protocol   C;
verify-alg   sha1;
allow-two-primaries yes;
after-sb-0pridiscard-zero-changes;
after-sb-1pridiscard-secondary;
after-sb-2pridisconnect;
sndbuf-size  512k;
}
disk {
resync-rate  110M;
on-io-error  detach;
fencing  resource-and-stonith;
al-extents   3389;
}
handlers {
split-brain  "/usr/lib/drbd/notify-split-brain.sh 
**";

fence-peer   /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
}
}

Eric

Le 23/06/2016 08:47, Eric Bourguinat a écrit :

Hello,

centos 7.2.1511 - pacemaker 1.1.13 - corosync 2.3.4 - drbd 8.4.7-1 - 
drbd84-utils 8.9.5
Linux ftpprod04 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux => pcmk-1
Linux ftpprod05 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux => pcmk-2


source : 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/


My resources are:

pcs resource
 Master/Slave Set: HomeDataClone [HomeData]
 Masters: [ pcmk-1 pcmk-2 ]
 Clone Set: dlm-clone [dlm]
 Started: [ pcmk-1 pcmk-2 ]
 Clone Set: ClusterIP-clone [ClusterIP] (unique)
 ClusterIP:0(ocf::heartbeat:IPaddr2):Started pcmk-1
 ClusterIP:1(ocf::heartbeat:IPaddr2):Started pcmk-2
 Clone Set: HomeFS-clone [HomeFS]
 Started: [ pcmk-1 pcmk-2 ]
 Clone Set: Ftp-clone [Ftp]
 Started: [ pcmk-1 pcmk-2 ]
 Clone Set: Sftp-clone [Sftp]
 Started: [ pcmk-1 pcmk-2 ]

I've a problem when testing failover.

"pkill -9 corosync" on pcmk-2
- stonith reboot pcmk-2 from pcmk-1
- a constraint is set
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++ /cib/configuration/constraints: rsc="HomeDataClone" id="drbd-fence-by-handler-home-HomeDataClone"/>
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++role="Master" score="-INFINITY" 
id="drbd-fence-by-handler-home-rule-HomeDataClone">
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++ value="ftpprod04" id="drbd-fence-by-handler-home-expr-HomeDataClone"/>
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++ 
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++ 
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_process_request: Completed cib_create operation for section 
constraints: OK (rc=0, origin=pcmk-1/cibadmin/2, version=0.584.0)

- but
Jun 22 10:34:36 [1806] ftpprod04pengine:   notice: 
LogActions:  Demote  HomeData:0  (Master -> Slave pcmk-1)

Why pengine demotes my survivor node?
- the result is that all services of the cluster are stopped
Stack: corosync
Current DC: pcmk-1 (version 1.1.13-10.el7_2.2-44eb2dd) - partition 
with quorum

2 nodes and 14 resources configured

Online: [ pcmk-1 ]
OFFLINE: [ pcmk-2 ]

Full list of resources:

 Master/Slave Set: HomeDataClone [HomeData]
 Stopped: [ pcmk-1 pcmk-2 ]
 Clone Set: dlm-clone [dlm]
 Stopped: [ pcmk-1 pcmk-2 ]
 Clone Set: ClusterIP-clone [ClusterIP] (unique)
 ClusterIP:0(ocf::heartbeat:IPaddr2):Stopped
 ClusterIP:1(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: HomeFS-clone [HomeFS]
 Stopped: [ pcmk-1 pcmk-2 ]
 Clone Set: Ftp-clone [Ftp]
 Stopped: [ pcmk-1 pcmk-2 ]
 Clone Set: Sftp-clone [Sftp]
 Stopped: [ pcmk-1 pcmk-2 ]
 fence-pcmk-1(stonith:fence_ovh):Stopped
 fence-pcmk-2(stonith:fence_ovh):Stopped

PCSD Status:
  pcmk-1: Online
  pcmk-2: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
- if I launch cluster on pcmk-2 drbd resync, the 2 nodes becomes 
primary, the constraint is removed, all the services are started


My constraints:
pcs constraint
Location Constraints:
  Resource: fence-pcmk-1
Enabled on: pcmk-2 (score:INFINITY

Re: [ClusterLabs] design question to DRBD

2016-06-23 Thread Lentes, Bernd


- On Jun 22, 2016, at 11:48 PM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:

> On 06/22/2016 04:29 PM, Klaus Wenninger wrote:
>> On 06/22/2016 11:17 PM, Lentes, Bernd wrote:
> 
>>> I'm thinking about active/active. But i think active/passive with a
>>> non-cluster fs is less complicated.
>> But you will need something to control DRBD - especially in the
>> active/passive-case.
>> And the services/IPs would probably have to be pulled to the active side.
> 
> It looks like with modern linux kernels you don't have to
> re-bind()/listen() anymore when an IP address is added. So you can start
> services bound to '*' from init and have pacemaker only manage the
> shared ip address.
> 
> But yes, with active/passive DRBD you need something to control DRBD and
> mount DRBD FS and then start services that depend on DRBD FS.
> 
> Active-active should let you have your filesystem mounted on both nodes
> at once and have things running from init. I never tried it myself so I
> don't know which of them would be "less complicated".
> 
> --

What i mean with "less complicated" is that i prefer to have everything
managed by pacemaker and not some stuff by pacemaker and some stuff by init.
This is more overseeable.


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Dr. Alfons Enhsen, Renate Schlusen 
(komm.)
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Few questions regarding corosync authkey

2016-06-23 Thread Ulrich Windl
>>> Jan Friesse  schrieb am 06.06.2016 um 09:01 in 
>>> Nachricht
<57551fc5.9000...@redhat.com>:
>>  Hi,
>>
>> Would like to understand how secure is the corosync authkey.
>> As the authkey is a binary file, how is the private key saved inside the
>> authkey?
> 
> Corosync uses symmetric encryption, so there is no public certificate. 
> authkey = private key
> 
>> What safeguard mechanisms are in place if the private key is compromised?

I don't know the details, but I'm assuming the key is stored as a simple binary 
stream in the file.
Easy things against random (not intentional) corruption would be:
1) Add the key length at the start
2) Store the key twice, like   , maybe using 
the 1-complement for the second copy.
3) Alternatively, also provide some checksum at the start or end of the key, 
maybe like:   [] 

So the user of the key would at least verify the integrity at start and 
probably complain loud if it seems corrupted, or re-read and check the key 
occasionally. Re-reading the key would be a first step for allowing to upgrade 
the key.

I guess a node using a different key will be fenced as not responding very 
soon; right?

(Sorry for the late reply, I was busy doing nothing the last two weeks ;-) Now 
trying to read a few thousand messages...

Regards,
Ulrich

> 
> No safeguard mechanisms. Compromised authkey = problem.
> 
>> For e.g I don't think it uses any temporary session key which refreshes
>> periodically.
> 
> Exactly
> 
>> Is it possible to dynamically update the key without causing any outage?
> 
> Nope
> 
> Regards,
>Honza
> 
>>
>> -Thanks
>> Nikhil
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Bundler Error while building PCSD

2016-06-23 Thread Tomas Jelinek

Dne 22.6.2016 v 13:06 Anup Halarnkar napsal(a):

*OS*: RHEL 7.2*
Arch*: PPC64LE*
uname -a :*Linux x 3.10.0-229.ael7b.ppc64le _#1_
SMP Fri Jan 30 12:03:50 EST
2015 ppc64le ppc64le ppc64le GNU/Linux

*Versions of installed ruby packages:*
libruby2_1-2_1.ppc64le 2.1.3-4.1

ruby.ppc64le 2.0.0.598-25.el7_1

ruby-irb.noarch 2.0.0.598-25.el7_1

ruby-libs.ppc64le 2.0.0.598-25.el7_1

rubygem-bigdecimal.ppc64le 1.2.0-25.el7_1

rubygem-bundler.noarch 1.7.8-3.el7

rubygem-io-console.ppc64le 0.4.2-25.el7_1

rubygem-json.ppc64le 1.7.7-25.el7_1

rubygem-net-http-persistent.noarch
rubygem-psych.ppc64le 2.0.0-25.el7_1

rubygem-rdoc.noarch 4.0.0-25.el7_1

rubygem-thor.noarch 0.19.1-1.el7

rubygems.noarch 2.0.14-25.el7_1

After successful build of *PCS*, I followed instructions on *README*to
build *PCSD*. But, I got below error.

//***
$*make get_gems
bundle package
Fetching gem metadata from _https://rubygems.org/_
Fetching version metadata from _https://rubygems.org/_
Resolving dependencies...
Rubygems 2.0.14 is not threadsafe, so your gems will be installed one at
a time. Upgrade to Rubygems 2.1.0 or higher to enable parallel gem
installation.
Using backports 3.6.8
Installing eventmachine 1.2.0.1 with native extensions

Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native
extension.

/usr/bin/ruby extconf.rb

mkmf.rb can't find header files for ruby at /usr/share/include/ruby.h

Gem files will remain installed in
/home/anuph/.gem/ruby/gems/eventmachine-1.2.0.1 for inspection.
Results logged to
/home/anuph/.gem/ruby/gems/eventmachine-1.2.0.1/ext/gem_make.out
Installing json 1.8.3 with native extensions

Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native
extension.

/usr/bin/ruby extconf.rb

mkmf.rb can't find header files for ruby at /usr/share/include/ruby.h

Gem files will remain installed in /home/anuph/.gem/ruby/gems/json-1.8.3
for inspection.
Results logged to
/home/anuph/.gem/ruby/gems/json-1.8.3/ext/json/ext/generator/gem_make.out
Using multi_json 1.12.0
Using open4 1.3.4
Using orderedhash 0.0.6
Using rack 1.6.4
Installing rpam-ruby19 1.2.1 with native extensions

Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native
extension.

/usr/bin/ruby extconf.rb

mkmf.rb can't find header files for ruby at /usr/share/include/ruby.h

Gem files will remain installed in
/home/anuph/.gem/ruby/gems/rpam-ruby19-1.2.1 for inspection.
Results logged to
/home/anuph/.gem/ruby/gems/rpam-ruby19-1.2.1/ext/Rpam/gem_make.out
Using tilt 2.0.3
Using bundler 1.12.5
An error occurred while installing eventmachine (1.2.0.1), and Bundler
cannot continue.
Make sure that gem install eventmachine -v '1.2.0.1'succeeds before
bundling.
make: *** [get_gems] Error 5

//*
I feel this error could be related to mismatch in ruby/gem versions. Not
sure though.
Will appreciate any inputs on above.


The same question has been asked at 
https://github.com/ClusterLabs/pcs/issues/99 and 
develop...@clusterlabs.org , let's keep the discussion there.


Thanks,
Tomas



Thanks and Regards,
Anup Halarnkar


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Failover problem with dual primary drbd

2016-06-23 Thread Eric Bourguinat

My drbd config

drbdadm dump all

# /etc/drbd.conf
global {
usage-count yes;
cmd-timeout-medium 600;
cmd-timeout-long 0;
}

common {
}

# resource home on ftpprod04: not ignored, not stacked
# defined at /etc/drbd.d/home.res:1
resource home {
on ftpprod04 {
device   /dev/drbd1 minor 1;
disk /dev/vghome/lvhome;
meta-diskinternal;
address  ipv4 192.168.122.101:7789;
}
on ftpprod05 {
device   /dev/drbd1 minor 1;
disk /dev/vghome/lvhome;
meta-diskinternal;
address  ipv4 192.168.122.102:7789;
}
net {
protocol   C;
verify-alg   sha1;
allow-two-primaries yes;
after-sb-0pridiscard-zero-changes;
after-sb-1pridiscard-secondary;
after-sb-2pridisconnect;
sndbuf-size  512k;
}
disk {
resync-rate  110M;
on-io-error  detach;
fencing  resource-and-stonith;
al-extents   3389;
}
handlers {
split-brain  "/usr/lib/drbd/notify-split-brain.sh 
**";

fence-peer   /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
}
}

Eric

Le 23/06/2016 08:47, Eric Bourguinat a écrit :

Hello,

centos 7.2.1511 - pacemaker 1.1.13 - corosync 2.3.4 - drbd 8.4.7-1 - 
drbd84-utils 8.9.5
Linux ftpprod04 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux => pcmk-1
Linux ftpprod05 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux => pcmk-2


source : 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/


My resources are:

pcs resource
 Master/Slave Set: HomeDataClone [HomeData]
 Masters: [ pcmk-1 pcmk-2 ]
 Clone Set: dlm-clone [dlm]
 Started: [ pcmk-1 pcmk-2 ]
 Clone Set: ClusterIP-clone [ClusterIP] (unique)
 ClusterIP:0(ocf::heartbeat:IPaddr2):Started pcmk-1
 ClusterIP:1(ocf::heartbeat:IPaddr2):Started pcmk-2
 Clone Set: HomeFS-clone [HomeFS]
 Started: [ pcmk-1 pcmk-2 ]
 Clone Set: Ftp-clone [Ftp]
 Started: [ pcmk-1 pcmk-2 ]
 Clone Set: Sftp-clone [Sftp]
 Started: [ pcmk-1 pcmk-2 ]

I've a problem when testing failover.

"pkill -9 corosync" on pcmk-2
- stonith reboot pcmk-2 from pcmk-1
- a constraint is set
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++ /cib/configuration/constraints: rsc="HomeDataClone" id="drbd-fence-by-handler-home-HomeDataClone"/>
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++role="Master" score="-INFINITY" 
id="drbd-fence-by-handler-home-rule-HomeDataClone">
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++ value="ftpprod04" id="drbd-fence-by-handler-home-expr-HomeDataClone"/>
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++ 
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_perform_op:  ++ 
Jun 22 10:34:36 [1802] ftpprod04cib: info: 
cib_process_request: Completed cib_create operation for section 
constraints: OK (rc=0, origin=pcmk-1/cibadmin/2, version=0.584.0)

- but
Jun 22 10:34:36 [1806] ftpprod04pengine:   notice: 
LogActions:  Demote  HomeData:0  (Master -> Slave pcmk-1)

Why pengine demotes my survivor node?
- the result is that all services of the cluster are stopped
Stack: corosync
Current DC: pcmk-1 (version 1.1.13-10.el7_2.2-44eb2dd) - partition 
with quorum

2 nodes and 14 resources configured

Online: [ pcmk-1 ]
OFFLINE: [ pcmk-2 ]

Full list of resources:

 Master/Slave Set: HomeDataClone [HomeData]
 Stopped: [ pcmk-1 pcmk-2 ]
 Clone Set: dlm-clone [dlm]
 Stopped: [ pcmk-1 pcmk-2 ]
 Clone Set: ClusterIP-clone [ClusterIP] (unique)
 ClusterIP:0(ocf::heartbeat:IPaddr2):Stopped
 ClusterIP:1(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: HomeFS-clone [HomeFS]
 Stopped: [ pcmk-1 pcmk-2 ]
 Clone Set: Ftp-clone [Ftp]
 Stopped: [ pcmk-1 pcmk-2 ]
 Clone Set: Sftp-clone [Sftp]
 Stopped: [ pcmk-1 pcmk-2 ]
 fence-pcmk-1(stonith:fence_ovh):Stopped
 fence-pcmk-2(stonith:fence_ovh):Stopped

PCSD Status:
  pcmk-1: Online
  pcmk-2: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
- if I launch cluster on pcmk-2 drbd resync, the 2 nodes becomes 
primary, the constraint is removed, all the services are started


My constraints:
pcs constraint
Location Constraints:
  Resource: fence-pcmk-1
Enabled on: pcmk-2 (score:INFINITY)
  Resource: fence-pcmk-2
Enabled on: pcmk-1 (score:INFINITY)
Ordering Constraints:
  start ClusterIP-clone then start Ftp-clone (kind:Mandatory)
  start ClusterIP-clone then start Sftp-clone (kind:Mandatory)
  promote HomeDataClone then start HomeFS-clone (kind: