Re: [DRBD-user] secundary not finish synchronizing [actually: automatic data loss by dual-primary, no-fencing, no cluster manager, and automatic after-split-brain recovery policy]

2018-02-13 Thread Lars Ellenberg
On Mon, Feb 12, 2018 at 03:59:26PM -0600, Ricky Gutierrez wrote:
> 2018-02-09 4:40 GMT-06:00 Lars Ellenberg :
> > On Thu, Feb 08, 2018 at 02:52:10PM -0600, Ricky Gutierrez wrote:
> >> 2018-02-08 7:28 GMT-06:00 Lars Ellenberg :
> >> > And your config is?
> >>
> >> resource zimbradrbd {
> >
> >> allow-two-primaries;
> >
> > Why dual primary?
> > I doubt you really need that.
> 
> I do not need it, zimbra does not support active - active

Don't add complexity you don't need.
Don't allow dual-primary, if you *MUST* use it exclusively anyways.

> >> after-sb-1pri discard-secondary;
> >
> > Here you tell it that,
> > if during a handshake after a cluster split brain
> > DRBD notices data divergence,
> > you want it to automatically resolve the situation
> > and discard all changes of the node that is Secondary
> > during that handshake, and overwrite it with the data
> > of the node that is Primary during that handshake.
> >
> >> become-primary-on both;
> >
> > And not even a cluster manager :-(
> 
> Here I forgot to mention, that for this function I am using pacemaker
> and corosync

Then don't tell the *init script* to promote DRBD.
If you are using a cluster manager,
controlling DRBD is the job of that cluster manager.

> >> > And you logs say?
> >> Feb  5 13:45:29 node-01 kernel:
> >> drbd zimbradrbd: conn( Disconnecting -> StandAlone )
> >
> > That's the uninteresting part (the disconnect).
> > The interesting part is the connection handshake.
> >
> >> > As is, I can only take an (educated) wild guess:
> >> >
> >> > Do you have no (or improperly configured) fencing,
> >>
> >> i don't have.
> >
> > Too bad :-(
> 
> some option to do it by software? and not by hardware.

There is DRBD "fencing policies" and handlers.
There is Pacemaker "node level fencing" (aka stonith).

To be able to avoid DRBD data divergence due to cluster split-brain,
you'd need both.  Stonith alone is not good enough, DRBD fencing
policies alone are not good enough. You need both.

If you absolutely refuse to use stonith, using at least
DRBD level fencing policies, combined with redundant cluster
communications, are better then not using no fencing at all,
but without stonith, there will still be failure scenarios
(certain cluster split brain scenarios) that will result in data
divergence on DRBD.

Data divergence is not necessarily better than data corruption,
you end up with two versions of data you cannot merge,
at least not in general.
With data corruption (which is the result of cluster split-brain on a
shared disk, without fencing), you at least go straight to your backup;
with replicated disk, and data divergence, you may first waste some time
trying to merge the data sets, before going for the backups anyways :-/

Without "auto-recovery" strategies configured, you at least
get to decide yourself which version to throw away.

DRBD allows you to "go without fencing", but that's just because there
are people who value "being online with some data" (which is potentially
outdated, but at least consistent), over "rather offline when in doubt".

That does not make it a good idea in general, though.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] secundary not finish synchronizing [actually: automatic data loss by dual-primary, no-fencing, no cluster manager, and automatic after-split-brain recovery policy]

2018-02-13 Thread Ricky Gutierrez
2018-02-09 4:40 GMT-06:00 Lars Ellenberg :
> On Thu, Feb 08, 2018 at 02:52:10PM -0600, Ricky Gutierrez wrote:
>> 2018-02-08 7:28 GMT-06:00 Lars Ellenberg :
>> > And your config is?
>>
>> resource zimbradrbd {
>
>> allow-two-primaries;
>
> Why dual primary?
> I doubt you really need that.

I do not need it, zimbra does not support active - active

>
>> after-sb-1pri discard-secondary;
>
> Here you tell it that,
> if during a handshake after a cluster split brain
> DRBD notices data divergence,
> you want it to automatically resolve the situation
> and discard all changes of the node that is Secondary
> during that handshake, and overwrite it with the data
> of the node that is Primary during that handshake.
>
>> become-primary-on both;
>
> And not even a cluster manager :-(

Here I forgot to mention, that for this function I am using pacemaker
and corosync

>
>> > And you logs say?
>> Feb  5 13:45:29 node-01 kernel:
>> drbd zimbradrbd: conn( Disconnecting -> StandAlone )
>
> That's the uninteresting part (the disconnect).
> The interesting part is the connection handshake.
>
>> > As is, I can only take an (educated) wild guess:
>> >
>> > Do you have no (or improperly configured) fencing,
>>
>> i don't have.
>
> Too bad :-(

some option to do it by software? and not by hardware.

>
>> > and funky auto-resolve policies configured,
>> > after-split-brain ... discard-secondary maybe?
>>
>> > Configuring automatic after-split-brain recovery policies
>> > is configuring automatic data loss.
>> >
>> > So if you don't mean that, don't do it.
>>
>> I am not an expert in drbd, but with the configuration that I have
>> according to the documentation,
>> should not this situation happen, or am I wrong?
>
> Unconditional dual primary directly by init script,
>  no fencing,
>  no cluster manager,
> and policies to automatically chose a victim after data divergence.
>
> You absolutely asked for that situation to happen.
>
> You told DRBD to
> go primary on startup
> ignoring the peer (no cluster manager, no fencing policies)
> and in case it detects data divergence, automatically throw away
> changes on whatever node may be Secondary at that point in time.
> Which is likely the node most recently rebooted.
>
> And that is what happened:
> network disconnected,
> you had both nodes Primary, both could (and likely did)
> keep changing their version of the data.
>
> Some time later, you rebooted (or deconfigured and reconfigured)
> DRBD on at least one node, and during the next DRBD handshake,
> DRBD noticed that both datasets have been changed independently,
> would have refused to connect by default, but you told it to
> automatically resolve that data conflict and discard changes on
> whatever node was Secondary during that handshake.
>
> Apparently the node with the "more interesting" (to you)
> data has been Secondary during that handshake,
> while the other with the "less inetersting" (to you) data
> already? still? has been Primary.
>
> And since you have configured DRBD for automatic data loss,
> you got automatic data loss.
>
> Note: those *example* settings for the auto-resolve policies
> may be ok *IF* you have proper fencing enabled.
>
> With fencing, there won't be data divergence, really,
> but DRBD would freeze IO when it first disconnects,
> the (or both) Primaries (at that time) will call the fence handler,
> the fence handler is expected to hard-reset the peer,
> the "winner" of that shoot-out will resume IO,
> the victim may or may not boot back up,
> NOT promote before it can see the peer,
> and during that next handshake obviously be Secondary.
>
> On the off chance that DRBD still thinks there had been data divergence,
> you may want to help it by automatically throwing away the (hopefully
> useless) changes on the then Secondary in this scenario.
> Because if the fencing did work, the shot node should just be
> "linearly older" than the survivor, and DRBD should do a normal resync.
> Which is why I would probably NOT want to do add those settings,
> but in case there is a data divergence detected, rather
> investigate what went wrong, and make a concious decision.
>
> But it is an option DRBD offers to those that want to enable it,
> and may make sense in some setups.  Mostly for those that value
> "online" over "most recent data", and can not afford
> "administrative intervention" for whatever reason.
>
> Maybe we should add a big red stay-out-here-be-dragons sign
> to both "allow-two-primaries" and the after-split-brain
> auto-recovery policy settings in the guide.

for the moment I have left my configuration in the following way, I
think it is the best scenario, even if I have to intervene manually.

resource zimbradrbd {
protocol C;
# DRBD device
device /dev/drbd0;
# block device
disk /dev/hamail/opt;
meta-disk internal;


syncer {
  rate 40M;

  }
   on node-01.domain.nl {
# IP address:port
address 192.168.20.91:7788;
}
on node-02.dom

Re: [DRBD-user] secundary not finish synchronizing [actually: automatic data loss by dual-primary, no-fencing, no cluster manager, and automatic after-split-brain recovery policy]

2018-02-09 Thread Lars Ellenberg
On Thu, Feb 08, 2018 at 02:52:10PM -0600, Ricky Gutierrez wrote:
> 2018-02-08 7:28 GMT-06:00 Lars Ellenberg :
> > And your config is?
> 
> resource zimbradrbd {

> allow-two-primaries;

Why dual primary?
I doubt you really need that.

> after-sb-1pri discard-secondary;

Here you tell it that,
if during a handshake after a cluster split brain
DRBD notices data divergence,
you want it to automatically resolve the situation
and discard all changes of the node that is Secondary
during that handshake, and overwrite it with the data
of the node that is Primary during that handshake.

> become-primary-on both;

And not even a cluster manager :-(

> > And you logs say?
> Feb  5 13:45:29 node-01 kernel:
> drbd zimbradrbd: conn( Disconnecting -> StandAlone )

That's the uninteresting part (the disconnect).
The interesting part is the connection handshake.

> > As is, I can only take an (educated) wild guess:
> >
> > Do you have no (or improperly configured) fencing,
> 
> i don't have.

Too bad :-(

> > and funky auto-resolve policies configured,
> > after-split-brain ... discard-secondary maybe?
>
> > Configuring automatic after-split-brain recovery policies
> > is configuring automatic data loss.
> >
> > So if you don't mean that, don't do it.
> 
> I am not an expert in drbd, but with the configuration that I have
> according to the documentation,
> should not this situation happen, or am I wrong?

Unconditional dual primary directly by init script,
 no fencing,
 no cluster manager,
and policies to automatically chose a victim after data divergence.

You absolutely asked for that situation to happen.

You told DRBD to
go primary on startup
ignoring the peer (no cluster manager, no fencing policies)
and in case it detects data divergence, automatically throw away
changes on whatever node may be Secondary at that point in time.
Which is likely the node most recently rebooted.

And that is what happened:
network disconnected,
you had both nodes Primary, both could (and likely did)
keep changing their version of the data.

Some time later, you rebooted (or deconfigured and reconfigured)
DRBD on at least one node, and during the next DRBD handshake,
DRBD noticed that both datasets have been changed independently,
would have refused to connect by default, but you told it to
automatically resolve that data conflict and discard changes on
whatever node was Secondary during that handshake.

Apparently the node with the "more interesting" (to you)
data has been Secondary during that handshake,
while the other with the "less inetersting" (to you) data
already? still? has been Primary.

And since you have configured DRBD for automatic data loss,
you got automatic data loss.

Note: those *example* settings for the auto-resolve policies
may be ok *IF* you have proper fencing enabled.

With fencing, there won't be data divergence, really,
but DRBD would freeze IO when it first disconnects,
the (or both) Primaries (at that time) will call the fence handler,
the fence handler is expected to hard-reset the peer,
the "winner" of that shoot-out will resume IO,
the victim may or may not boot back up,
NOT promote before it can see the peer,
and during that next handshake obviously be Secondary.

On the off chance that DRBD still thinks there had been data divergence,
you may want to help it by automatically throwing away the (hopefully
useless) changes on the then Secondary in this scenario.
Because if the fencing did work, the shot node should just be
"linearly older" than the survivor, and DRBD should do a normal resync.
Which is why I would probably NOT want to do add those settings,
but in case there is a data divergence detected, rather
investigate what went wrong, and make a concious decision.

But it is an option DRBD offers to those that want to enable it,
and may make sense in some setups.  Mostly for those that value
"online" over "most recent data", and can not afford
"administrative intervention" for whatever reason.

Maybe we should add a big red stay-out-here-be-dragons sign
to both "allow-two-primaries" and the after-split-brain
auto-recovery policy settings in the guide.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] secundary not finish synchronizing

2018-02-09 Thread Ricky Gutierrez
2018-02-08 7:28 GMT-06:00 Lars Ellenberg :

Hi Lars

> And your config is?

resource zimbradrbd {
protocol C;
# DRBD device
device /dev/drbd0;
# block device
disk /dev/hamail/opt;
meta-disk internal;

syncer {

rate 40M;

}

net {

allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;

}

startup {
become-primary-on both;

}

on node-01.domain.nl {
# IP address:port
address 192.168.20.91:7788;
}
on node-02.domain.nl {
address 192.168.20.92:7788;

}

}


> And you logs say?

Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: conn( WFConnection ->
Disconnecting )
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: Discarding network
configuration.
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: Connection closed
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: conn( Disconnecting
-> StandAlone )
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: receiver terminated
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: Terminating drbd_r_zimbradr
Feb  5 13:45:29 node-01 kernel: block drbd0: disk( UpToDate -> Failed )
Feb  5 13:45:29 node-01 kernel: block drbd0: 4622 MB (1183159 bits)
marked out-of-sync by on disk bit-map.
Feb  5 13:45:29 node-01 kernel: block drbd0: disk( Failed -> Diskless )
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: Terminating drbd_w_zimbradr



Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: peer( Secondary ->
Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate ->
DUnknown )
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: ack_receiver terminated
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: Terminating drbd_a_zimbradr
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: Connection closed
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: conn( Disconnecting
-> StandAlone )
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: receiver terminated
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: Terminating drbd_r_zimbradr
Feb  5 14:53:52 node-01 kernel: block drbd0: disk( UpToDate -> Failed )
Feb  5 14:53:52 node-01 kernel: block drbd0: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Feb  5 14:53:52 node-01 kernel: block drbd0: disk( Failed -> Diskless )
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: Terminating drbd_w_zimbradr



>
> As is, I can only take an (educated) wild guess:
>
> Do you have no (or improperly configured) fencing,

i don't have.

> and funky auto-resolve policies configured,
> after-split-brain ... discard-secondary maybe?



>
> Configuring automatic after-split-brain recovery policies
> is configuring automatic data loss.
>
> So if you don't mean that, don't do it.
>

I am not an expert in drbd, but with the configuration that I have
according to the documentation, should not this situation happen, or
am I wrong?

thank, I wait your answer




-- 
rickygm

http://gnuforever.homelinux.com
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] secundary not finish synchronizing

2018-02-08 Thread Lars Ellenberg
On Mon, Feb 05, 2018 at 04:38:21PM -0600, Ricky Gutierrez wrote:
> Hi list , I have a problem, I have two mail servers that replicate a
> secondary disk between them, there I have the mailboxes, they are as
> primary and secondary with centos 7, the problem is that there was a
> power cut twice, the primary server finished synchronizing with the
> secundary , and then there was another power cut and the secondary
> shutdown too, after this the power was restored and both power  on.
> 
> The problem is that I have lost data yesterday and today 

And your config is?
And you logs say?

As is, I can only take an (educated) wild guess:

Do you have no (or improperly configured) fencing,
and funky auto-resolve policies configured,
after-split-brain ... discard-secondary maybe?

Configuring automatic after-split-brain recovery policies
is configuring automatic data loss.

So if you don't mean that, don't do it.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user