Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-27 Thread Ken Gaillot
On Mon, 2017-11-13 at 10:24 -0500, Derek Wuelfrath wrote:
> Hello Ken !
> 
> > Make sure that the systemd service is not enabled. If pacemaker is
> > managing a service, systemd can't also be trying to start and stop
> > it.
> 
> It is not. I made sure of this in the first place :)
> 
> > Beyond that, the question is what log messages are there from
> > around
> > the time of the issue (on both nodes).
> 
> Well, that’s the thing. There is not much log messages telling what
> is actually happening. The ’systemd’ resource is not even trying to
> start (nothing in either log for that resource). Here are the logs
> from my last attempt:
> Scenario:
> - Services were running on ‘pancakeFence2’. DRBD was synced and
> connected
> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
> - After ‘pancakeFence2’ comes back, services are running just fine on
> ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
> 
> Logs for pancakeFence1: https://pastebin.com/dVSGPP78
> Logs for pancakeFence2: https://pastebin.com/at8qPkHE

When you say you rebooted the node, was it a clean reboot or a
simulated failure like power-off or kernel-panic? If it was a simulated
failure, then the behavior makes sense in this case. If a node
disappears for no known reason, DRBD ends up in split-brain. If fencing
were configured, the surviving node would fence the other one to be
sure it's down, but it might still be unable to reconnect to DRBD
without manual intervention.

The systemd issue is separate, and I can't think of what would cause
it. If you have PCMK_logfile set in /etc/sysconfig/pacemaker, you will
get more extensive log messages there. One node will be elected DC and
will have more "pengine:" messages than the other, that will show all
the decisions made about what actions to take, and the results of those
actions.

> It really looks like the status checkup mechanism of
> corosync/pacemaker for a systemd resource force the resource to
> “start” and therefore, start the ones above that resource in the
> group (DRBD in instance).
> This does not happen for a regular OCF resource (IPaddr2 per example)
> 
> Cheers!
> -dw
> 
> --
> Derek Wuelfrath
> dwuelfr...@inverse.ca :: +1.514.447.4918 (x110) :: +1.866.353.6153
> (x110)
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu), PacketFence
> (www.packetfence.org) and Fingerbank (www.fingerbank.org)
> 
> > On Nov 10, 2017, at 11:39, Ken Gaillot  wrote:
> > 
> > On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:
> > > Hello there,
> > > 
> > > First post here but following since a while!
> > 
> > Welcome!
> > 
> > > Here’s my issue,
> > > we are putting in place and running this type of cluster since a
> > > while and never really encountered this kind of problem.
> > > 
> > > I recently set up a Corosync / Pacemaker / PCS cluster to manage
> > > DRBD
> > > along with different other resources. Part of theses resources
> > > are
> > > some systemd resources… this is the part where things are
> > > “breaking”.
> > > 
> > > Having a two servers cluster running only DRBD or DRBD with an
> > > OCF
> > > ipaddr2 resource (Cluser IP in instance) works just fine. I can
> > > easily move from one node to the other without any issue.
> > > As soon as I add a systemd resource to the resource group, things
> > > are
> > > breaking. Moving from one node to the other using standby mode
> > > works
> > > just fine but as soon as Corosync / Pacemaker restart involves
> > > polling of a systemd resource, it seems like it is trying to
> > > start
> > > the whole resource group and therefore, create a split-brain of
> > > the
> > > DRBD resource.
> > 
> > My first two suggestions would be:
> > 
> > Make sure that the systemd service is not enabled. If pacemaker is
> > managing a service, systemd can't also be trying to start and stop
> > it.
> > 
> > Fencing is the only way pacemaker can resolve split-brains and
> > certain
> > other situations, so that will help in the recovery.
> > 
> > Beyond that, the question is what log messages are there from
> > around
> > the time of the issue (on both nodes).
> > 
> > 
> > > It is the best explanation / description of the situation that I
> > > can
> > > give. If it need any clarification, examples, … I am more than
> > > open
> > > to share them.
> > > 
> > > Any guidance would be appreciated :)
> > > 
> > > Here’s the output of a ‘pcs config’
> > > 
> > > https://pastebin.com/1TUvZ4X9
> > > 
> > > Cheers!
> > > -dw
> > > 
> > > --
> > > Derek Wuelfrath
> > > dwuelfr...@inverse.ca :: +1.514.447.4918 (x110) ::
> > > +1.866.353.6153
> > > (x110)
> > > Inverse inc. :: Leaders behind SOGo (www.sogo.nu), PacketFence
> > > (www.packetfence.org) and Fingerbank (www.fingerbank.org)
> > -- 
> > Ken Gaillot 
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: 

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-15 Thread Digimer
I've driven for 22 years and never needed my seatbelt before, but yet, I
still make sure I use it every time I am in a car. ;)

Why it happened now is perhaps an interesting question, but it is one I
would try to answer after fixing the core problem.

cheers,

digimer

On 2017-11-15 03:37 PM, Derek Wuelfrath wrote:
> And just to make sure, I’m not the kind of person who stick to the “we
> always did it that way…” ;)
> Just trying to figure out why it suddenly breaks.
> 
> -derek
> 
> --
> Derek Wuelfrath
> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918
> (x110) :: +1.866.353.6153 (x110)
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu>),
> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and
> Fingerbank (www.fingerbank.org <https://www.fingerbank.org>)
> 
>> On Nov 15, 2017, at 15:30, Derek Wuelfrath <dwuelfr...@inverse.ca
>> <mailto:dwuelfr...@inverse.ca>> wrote:
>>
>> I agree. Thing is, we have this kind of setup deployed largely and
>> since a while. Never ran into any issue.
>> Not sure if something changed in Corosync/Pacemaker code or way of
>> dealing with systemd resources.
>>
>> As said, without a systemd resource, everything just work as it
>> should… 100% of the time
>> As soon as a systemd resource comes in, it breaks.
>>
>> -derek
>>
>> --
>> Derek Wuelfrath
>> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> ::
>> +1.514.447.4918 (x110) :: +1.866.353.6153 (x110)
>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu
>> <https://www.sogo.nu/>), PacketFence (www.packetfence.org
>> <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org
>> <https://www.fingerbank.org/>)
>>
>>> On Nov 14, 2017, at 23:03, Digimer <li...@alteeve.ca
>>> <mailto:li...@alteeve.ca>> wrote:
>>>
>>> Quorum doesn't prevent split-brains, stonith (fencing) does. 
>>>
>>> https://www.alteeve.com/w/The_2-Node_Myth
>>>
>>> There is no way to use quorum-only to avoid a potential split-brain.
>>> You might be able to make it less likely with enough effort, but
>>> never prevent it.
>>>
>>> digimer
>>>
>>> On 2017-11-14 10:45 PM, Garima wrote:
>>>> Hello All,
>>>>  
>>>> Split-brain situation occurs due to there is a drop in quorum which
>>>> leads to Spilt-brain situation and status information is not
>>>> exchanged between both two nodes of the cluster. 
>>>> This can be avoided if quorum communicates between both the nodes.
>>>> I have checked the code. In My opinion these files need to be
>>>> updated (quorum.py/stonith.py) to avoid the spilt-brain situation to
>>>> maintain Active-Passive configuration.
>>>>  
>>>> Regards,
>>>> Garima
>>>>  
>>>> *From:* Derek Wuelfrath [mailto:dwuelfr...@inverse.ca] 
>>>> *Sent:* 13 November 2017 20:55
>>>> *To:* Cluster Labs - All topics related to open-source clustering
>>>> welcomed <users@clusterlabs.org>
>>>> *Subject:* Re: [ClusterLabs] Pacemaker responsible of DRBD and a
>>>> systemd resource
>>>>  
>>>> Hello Ken !
>>>>  
>>>>
>>>> Make sure that the systemd service is not enabled. If pacemaker is
>>>> managing a service, systemd can't also be trying to start and
>>>> stop it.
>>>>
>>>>  
>>>> It is not. I made sure of this in the first place :)
>>>>  
>>>>
>>>> Beyond that, the question is what log messages are there from around
>>>> the time of the issue (on both nodes).
>>>>
>>>>  
>>>> Well, that’s the thing. There is not much log messages telling what
>>>> is actually happening. The ’systemd’ resource is not even trying to
>>>> start (nothing in either log for that resource). Here are the logs
>>>> from my last attempt:
>>>> Scenario:
>>>> - Services were running on ‘pancakeFence2’. DRBD was synced and
>>>> connected
>>>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
>>>> - After ‘pancakeFence2’ comes back, services are running just fine
>>>> on ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
>>>>  
>>>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78
>>>> Logs for pancakeFence2: https://p

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-15 Thread Derek Wuelfrath
And just to make sure, I’m not the kind of person who stick to the “we always 
did it that way…” ;)
Just trying to figure out why it suddenly breaks.

-derek

--
Derek Wuelfrath
dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 (x110) 
:: +1.866.353.6153 (x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
PacketFence (www.packetfence.org <https://www.packetfence.org/>) and Fingerbank 
(www.fingerbank.org <https://www.fingerbank.org/>)

> On Nov 15, 2017, at 15:30, Derek Wuelfrath <dwuelfr...@inverse.ca> wrote:
> 
> I agree. Thing is, we have this kind of setup deployed largely and since a 
> while. Never ran into any issue.
> Not sure if something changed in Corosync/Pacemaker code or way of dealing 
> with systemd resources.
> 
> As said, without a systemd resource, everything just work as it should… 100% 
> of the time
> As soon as a systemd resource comes in, it breaks.
> 
> -derek
> 
> --
> Derek Wuelfrath
> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 
> (x110) :: +1.866.353.6153 (x110)
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and 
> Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>)
> 
>> On Nov 14, 2017, at 23:03, Digimer <li...@alteeve.ca 
>> <mailto:li...@alteeve.ca>> wrote:
>> 
>> Quorum doesn't prevent split-brains, stonith (fencing) does. 
>> 
>> https://www.alteeve.com/w/The_2-Node_Myth 
>> <https://www.alteeve.com/w/The_2-Node_Myth>
>> 
>> There is no way to use quorum-only to avoid a potential split-brain. You 
>> might be able to make it less likely with enough effort, but never prevent 
>> it.
>> 
>> digimer
>> 
>> On 2017-11-14 10:45 PM, Garima wrote:
>>> Hello All,
>>>  
>>> Split-brain situation occurs due to there is a drop in quorum which leads 
>>> to Spilt-brain situation and status information is not exchanged between 
>>> both two nodes of the cluster. 
>>> This can be avoided if quorum communicates between both the nodes.
>>> I have checked the code. In My opinion these files need to be updated 
>>> (quorum.py/stonith.py) to avoid the spilt-brain situation to maintain 
>>> Active-Passive configuration.
>>>  
>>> Regards,
>>> Garima
>>>  
>>> From: Derek Wuelfrath [mailto:dwuelfr...@inverse.ca 
>>> <mailto:dwuelfr...@inverse.ca>] 
>>> Sent: 13 November 2017 20:55
>>> To: Cluster Labs - All topics related to open-source clustering welcomed 
>>> <users@clusterlabs.org> <mailto:users@clusterlabs.org>
>>> Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd 
>>> resource
>>>  
>>> Hello Ken !
>>>  
>>> Make sure that the systemd service is not enabled. If pacemaker is
>>> managing a service, systemd can't also be trying to start and stop it.
>>>  
>>> It is not. I made sure of this in the first place :)
>>>  
>>> Beyond that, the question is what log messages are there from around
>>> the time of the issue (on both nodes).
>>>  
>>> Well, that’s the thing. There is not much log messages telling what is 
>>> actually happening. The ’systemd’ resource is not even trying to start 
>>> (nothing in either log for that resource). Here are the logs from my last 
>>> attempt:
>>> Scenario:
>>> - Services were running on ‘pancakeFence2’. DRBD was synced and connected
>>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
>>> - After ‘pancakeFence2’ comes back, services are running just fine on 
>>> ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
>>>  
>>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78 
>>> <https://pastebin.com/dVSGPP78>
>>> Logs for pancakeFence2: https://pastebin.com/at8qPkHE 
>>> <https://pastebin.com/at8qPkHE>
>>>  
>>> It really looks like the status checkup mechanism of corosync/pacemaker for 
>>> a systemd resource force the resource to “start” and therefore, start the 
>>> ones above that resource in the group (DRBD in instance).
>>> This does not happen for a regular OCF resource (IPaddr2 per example)
>>> 
>>> Cheers!
>>> -dw
>>>  
>>> --
>>> Derek Wuelfrath
>>> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 
>>> (x110) :: +1.866

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-15 Thread Derek Wuelfrath
I agree. Thing is, we have this kind of setup deployed largely and since a 
while. Never ran into any issue.
Not sure if something changed in Corosync/Pacemaker code or way of dealing with 
systemd resources.

As said, without a systemd resource, everything just work as it should… 100% of 
the time
As soon as a systemd resource comes in, it breaks.

-derek

--
Derek Wuelfrath
dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 (x110) 
:: +1.866.353.6153 (x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
PacketFence (www.packetfence.org <https://www.packetfence.org/>) and Fingerbank 
(www.fingerbank.org <https://www.fingerbank.org/>)

> On Nov 14, 2017, at 23:03, Digimer <li...@alteeve.ca> wrote:
> 
> Quorum doesn't prevent split-brains, stonith (fencing) does. 
> 
> https://www.alteeve.com/w/The_2-Node_Myth 
> <https://www.alteeve.com/w/The_2-Node_Myth>
> 
> There is no way to use quorum-only to avoid a potential split-brain. You 
> might be able to make it less likely with enough effort, but never prevent it.
> 
> digimer
> 
> On 2017-11-14 10:45 PM, Garima wrote:
>> Hello All,
>>  
>> Split-brain situation occurs due to there is a drop in quorum which leads to 
>> Spilt-brain situation and status information is not exchanged between both 
>> two nodes of the cluster. 
>> This can be avoided if quorum communicates between both the nodes.
>> I have checked the code. In My opinion these files need to be updated 
>> (quorum.py/stonith.py) to avoid the spilt-brain situation to maintain 
>> Active-Passive configuration.
>>  
>> Regards,
>> Garima
>>  
>> From: Derek Wuelfrath [mailto:dwuelfr...@inverse.ca 
>> <mailto:dwuelfr...@inverse.ca>] 
>> Sent: 13 November 2017 20:55
>> To: Cluster Labs - All topics related to open-source clustering welcomed 
>> <users@clusterlabs.org> <mailto:users@clusterlabs.org>
>> Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd 
>> resource
>>  
>> Hello Ken !
>>  
>> Make sure that the systemd service is not enabled. If pacemaker is
>> managing a service, systemd can't also be trying to start and stop it.
>>  
>> It is not. I made sure of this in the first place :)
>>  
>> Beyond that, the question is what log messages are there from around
>> the time of the issue (on both nodes).
>>  
>> Well, that’s the thing. There is not much log messages telling what is 
>> actually happening. The ’systemd’ resource is not even trying to start 
>> (nothing in either log for that resource). Here are the logs from my last 
>> attempt:
>> Scenario:
>> - Services were running on ‘pancakeFence2’. DRBD was synced and connected
>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
>> - After ‘pancakeFence2’ comes back, services are running just fine on 
>> ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
>>  
>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78 
>> <https://pastebin.com/dVSGPP78>
>> Logs for pancakeFence2: https://pastebin.com/at8qPkHE 
>> <https://pastebin.com/at8qPkHE>
>>  
>> It really looks like the status checkup mechanism of corosync/pacemaker for 
>> a systemd resource force the resource to “start” and therefore, start the 
>> ones above that resource in the group (DRBD in instance).
>> This does not happen for a regular OCF resource (IPaddr2 per example)
>> 
>> Cheers!
>> -dw
>>  
>> --
>> Derek Wuelfrath
>> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 
>> (x110) :: +1.866.353.6153 (x110)
>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
>> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and 
>> Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>)
>> 
>> 
>> On Nov 10, 2017, at 11:39, Ken Gaillot <kgail...@redhat.com 
>> <mailto:kgail...@redhat.com>> wrote:
>>  
>> On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:
>> 
>> Hello there,
>> 
>> First post here but following since a while!
>> 
>> Welcome!
>> 
>> 
>> 
>> Here’s my issue,
>> we are putting in place and running this type of cluster since a
>> while and never really encountered this kind of problem.
>> 
>> I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD
>> along with different other resources. Part of theses resources are
>> some systemd resources… this is the part

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-14 Thread Digimer

  
  
Quorum doesn't prevent split-brains,
  stonith (fencing) does. 
  
  https://www.alteeve.com/w/The_2-Node_Myth
  
  There is no way to use quorum-only to avoid a potential
  split-brain. You might be able to make it less likely with enough
  effort, but never prevent it.
  
  digimer
  
  On 2017-11-14 10:45 PM, Garima wrote:


  
  
  
  
Hello
All,
 
Split-brain
situation occurs due to there is a drop in quorum which
leads to Spilt-brain situation and status information is not
exchanged between both two nodes of the cluster. 
This
can be avoided if quorum communicates between both the
nodes.

I
have checked the code. In My opinion these files need to be
updated (quorum.py/stonith.py) to avoid the spilt-brain
situation to maintain Active-Passive configuration.
 
Regards,
Garima
 

  
From: Derek Wuelfrath
[mailto:dwuelfr...@inverse.ca]

Sent: 13 November 2017 20:55
To: Cluster Labs - All topics related to
open-source clustering welcomed
<users@clusterlabs.org>
Subject: Re: [ClusterLabs] Pacemaker responsible
    of DRBD and a systemd resource
  

 
Hello Ken !

   


  
Make sure that the systemd service is
  not enabled. If pacemaker is
  managing a service, systemd can't also be trying to start
  and stop it.
  
   


  It is not. I made sure of this in the
first place :)


   


  
Beyond that, the question is what log
  messages are there from around
  the time of the issue (on both nodes).
  
   


  Well, that’s the thing. There is not much
log messages telling what is actually happening. The
’systemd’ resource is not even trying to start (nothing in
either log for that resource). Here are the logs from my
last attempt:


  Scenario:


  - Services were running on
‘pancakeFence2’. DRBD was synced and connected


  - I rebooted ‘pancakeFence2’. Services
failed to ‘pancakeFence1’


  - After ‘pancakeFence2’ comes back,
services are running just fine on ‘pancakeFence1’ but DRBD
is in Standalone due to split-brain


   


  Logs for pancakeFence1: https://pastebin.com/dVSGPP78


  Logs for pancakeFence2: https://pastebin.com/at8qPkHE


   


  It really looks like the status checkup
mechanism of corosync/pacemaker for a systemd resource force
the resource to “start” and therefore, start the ones above
that resource in the group (DRBD in instance).


  This does not happen for a regular OCF
resource (IPaddr2 per example)


  

  

  

  

  

  

  
  Cheers!


  -dw


   


  --


  Derek
  Wuelfrath


  dwuelfr...@inverse.ca ::
  +1.514.447.4918 (x110) ::
  +1.866.353.6153 (x110)


  Inverse
  inc. :: Leaders behind SOGo (www.sogo.nu),
  PacketFence (www.packetfence.org)
  and Fingerbank (www.fingerba

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-14 Thread Garima
Hello All,

Split-brain situation occurs due to there is a drop in quorum which leads to 
Spilt-brain situation and status information is not exchanged between both two 
nodes of the cluster.
This can be avoided if quorum communicates between both the nodes.
I have checked the code. In My opinion these files need to be updated 
(quorum.py/stonith.py) to avoid the spilt-brain situation to maintain 
Active-Passive configuration.

Regards,
Garima

From: Derek Wuelfrath [mailto:dwuelfr...@inverse.ca]
Sent: 13 November 2017 20:55
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

Hello Ken !

Make sure that the systemd service is not enabled. If pacemaker is
managing a service, systemd can't also be trying to start and stop it.

It is not. I made sure of this in the first place :)

Beyond that, the question is what log messages are there from around
the time of the issue (on both nodes).

Well, that’s the thing. There is not much log messages telling what is actually 
happening. The ’systemd’ resource is not even trying to start (nothing in 
either log for that resource). Here are the logs from my last attempt:
Scenario:
- Services were running on ‘pancakeFence2’. DRBD was synced and connected
- I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
- After ‘pancakeFence2’ comes back, services are running just fine on 
‘pancakeFence1’ but DRBD is in Standalone due to split-brain

Logs for pancakeFence1: https://pastebin.com/dVSGPP78
Logs for pancakeFence2: https://pastebin.com/at8qPkHE

It really looks like the status checkup mechanism of corosync/pacemaker for a 
systemd resource force the resource to “start” and therefore, start the ones 
above that resource in the group (DRBD in instance).
This does not happen for a regular OCF resource (IPaddr2 per example)

Cheers!
-dw

--
Derek Wuelfrath
dwuelfr...@inverse.ca<mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 (x110) 
:: +1.866.353.6153 (x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu<https://www.sogo.nu/>), 
PacketFence (www.packetfence.org<https://www.packetfence.org/>) and Fingerbank 
(www.fingerbank.org<https://www.fingerbank.org>)


On Nov 10, 2017, at 11:39, Ken Gaillot 
<kgail...@redhat.com<mailto:kgail...@redhat.com>> wrote:

On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:

Hello there,

First post here but following since a while!

Welcome!



Here’s my issue,
we are putting in place and running this type of cluster since a
while and never really encountered this kind of problem.

I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD
along with different other resources. Part of theses resources are
some systemd resources… this is the part where things are “breaking”.

Having a two servers cluster running only DRBD or DRBD with an OCF
ipaddr2 resource (Cluser IP in instance) works just fine. I can
easily move from one node to the other without any issue.
As soon as I add a systemd resource to the resource group, things are
breaking. Moving from one node to the other using standby mode works
just fine but as soon as Corosync / Pacemaker restart involves
polling of a systemd resource, it seems like it is trying to start
the whole resource group and therefore, create a split-brain of the
DRBD resource.

My first two suggestions would be:

Make sure that the systemd service is not enabled. If pacemaker is
managing a service, systemd can't also be trying to start and stop it.

Fencing is the only way pacemaker can resolve split-brains and certain
other situations, so that will help in the recovery.

Beyond that, the question is what log messages are there from around
the time of the issue (on both nodes).




It is the best explanation / description of the situation that I can
give. If it need any clarification, examples, … I am more than open
to share them.

Any guidance would be appreciated :)

Here’s the output of a ‘pcs config’

https://pastebin.com/1TUvZ4X9

Cheers!
-dw

--
Derek Wuelfrath
dwuelfr...@inverse.ca<mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 (x110) 
:: +1.866.353.6153
(x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu<http://www.sogo.nu>), 
PacketFence
(www.packetfence.org<http://www.packetfence.org>) and Fingerbank 
(www.fingerbank.org<http://www.fingerbank.org>)
--
Ken Gaillot <kgail...@redhat.com<mailto:kgail...@redhat.com>>

___
Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org>
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/>
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/>

___
Users maili

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-10 Thread Ken Gaillot
On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:
> Hello there,
> 
> First post here but following since a while!

Welcome!

> 
> Here’s my issue,
> we are putting in place and running this type of cluster since a
> while and never really encountered this kind of problem.
> 
> I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD
> along with different other resources. Part of theses resources are
> some systemd resources… this is the part where things are “breaking”.
> 
> Having a two servers cluster running only DRBD or DRBD with an OCF
> ipaddr2 resource (Cluser IP in instance) works just fine. I can
> easily move from one node to the other without any issue.
> As soon as I add a systemd resource to the resource group, things are
> breaking. Moving from one node to the other using standby mode works
> just fine but as soon as Corosync / Pacemaker restart involves
> polling of a systemd resource, it seems like it is trying to start
> the whole resource group and therefore, create a split-brain of the
> DRBD resource.

My first two suggestions would be:

Make sure that the systemd service is not enabled. If pacemaker is
managing a service, systemd can't also be trying to start and stop it.

Fencing is the only way pacemaker can resolve split-brains and certain
other situations, so that will help in the recovery.

Beyond that, the question is what log messages are there from around
the time of the issue (on both nodes).


> 
> It is the best explanation / description of the situation that I can
> give. If it need any clarification, examples, … I am more than open
> to share them.
> 
> Any guidance would be appreciated :)
> 
> Here’s the output of a ‘pcs config’
> 
> https://pastebin.com/1TUvZ4X9
> 
> Cheers!
> -dw
> 
> --
> Derek Wuelfrath
> dwuelfr...@inverse.ca :: +1.514.447.4918 (x110) :: +1.866.353.6153
> (x110)
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu), PacketFence
> (www.packetfence.org) and Fingerbank (www.fingerbank.org)
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-09 Thread Derek Wuelfrath
Hello there,

First post here but following since a while!

Here’s my issue,
we are putting in place and running this type of cluster since a while and 
never really encountered this kind of problem.

I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD along 
with different other resources. Part of theses resources are some systemd 
resources… this is the part where things are “breaking”.

Having a two servers cluster running only DRBD or DRBD with an OCF ipaddr2 
resource (Cluser IP in instance) works just fine. I can easily move from one 
node to the other without any issue.
As soon as I add a systemd resource to the resource group, things are breaking. 
Moving from one node to the other using standby mode works just fine but as 
soon as Corosync / Pacemaker restart involves polling of a systemd resource, it 
seems like it is trying to start the whole resource group and therefore, create 
a split-brain of the DRBD resource.

It is the best explanation / description of the situation that I can give. If 
it need any clarification, examples, … I am more than open to share them.

Any guidance would be appreciated :)

Here’s the output of a ‘pcs config’

https://pastebin.com/1TUvZ4X9 

Cheers!
-dw

--
Derek Wuelfrath
dwuelfr...@inverse.ca  :: +1.514.447.4918 (x110) 
:: +1.866.353.6153 (x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu ), 
PacketFence (www.packetfence.org ) and Fingerbank 
(www.fingerbank.org )

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org