Re: [DRBD-user] secundary not finish synchronizing [actually: automatic data loss by dual-primary, no-fencing, no cluster manager, and automatic after-split-brain recovery policy]

2018-02-09 Thread Lars Ellenberg
On Thu, Feb 08, 2018 at 02:52:10PM -0600, Ricky Gutierrez wrote:
> 2018-02-08 7:28 GMT-06:00 Lars Ellenberg :
> > And your config is?
> 
> resource zimbradrbd {

> allow-two-primaries;

Why dual primary?
I doubt you really need that.

> after-sb-1pri discard-secondary;

Here you tell it that,
if during a handshake after a cluster split brain
DRBD notices data divergence,
you want it to automatically resolve the situation
and discard all changes of the node that is Secondary
during that handshake, and overwrite it with the data
of the node that is Primary during that handshake.

> become-primary-on both;

And not even a cluster manager :-(

> > And you logs say?
> Feb  5 13:45:29 node-01 kernel:
> drbd zimbradrbd: conn( Disconnecting -> StandAlone )

That's the uninteresting part (the disconnect).
The interesting part is the connection handshake.

> > As is, I can only take an (educated) wild guess:
> >
> > Do you have no (or improperly configured) fencing,
> 
> i don't have.

Too bad :-(

> > and funky auto-resolve policies configured,
> > after-split-brain ... discard-secondary maybe?
>
> > Configuring automatic after-split-brain recovery policies
> > is configuring automatic data loss.
> >
> > So if you don't mean that, don't do it.
> 
> I am not an expert in drbd, but with the configuration that I have
> according to the documentation,
> should not this situation happen, or am I wrong?

Unconditional dual primary directly by init script,
 no fencing,
 no cluster manager,
and policies to automatically chose a victim after data divergence.

You absolutely asked for that situation to happen.

You told DRBD to
go primary on startup
ignoring the peer (no cluster manager, no fencing policies)
and in case it detects data divergence, automatically throw away
changes on whatever node may be Secondary at that point in time.
Which is likely the node most recently rebooted.

And that is what happened:
network disconnected,
you had both nodes Primary, both could (and likely did)
keep changing their version of the data.

Some time later, you rebooted (or deconfigured and reconfigured)
DRBD on at least one node, and during the next DRBD handshake,
DRBD noticed that both datasets have been changed independently,
would have refused to connect by default, but you told it to
automatically resolve that data conflict and discard changes on
whatever node was Secondary during that handshake.

Apparently the node with the "more interesting" (to you)
data has been Secondary during that handshake,
while the other with the "less inetersting" (to you) data
already? still? has been Primary.

And since you have configured DRBD for automatic data loss,
you got automatic data loss.

Note: those *example* settings for the auto-resolve policies
may be ok *IF* you have proper fencing enabled.

With fencing, there won't be data divergence, really,
but DRBD would freeze IO when it first disconnects,
the (or both) Primaries (at that time) will call the fence handler,
the fence handler is expected to hard-reset the peer,
the "winner" of that shoot-out will resume IO,
the victim may or may not boot back up,
NOT promote before it can see the peer,
and during that next handshake obviously be Secondary.

On the off chance that DRBD still thinks there had been data divergence,
you may want to help it by automatically throwing away the (hopefully
useless) changes on the then Secondary in this scenario.
Because if the fencing did work, the shot node should just be
"linearly older" than the survivor, and DRBD should do a normal resync.
Which is why I would probably NOT want to do add those settings,
but in case there is a data divergence detected, rather
investigate what went wrong, and make a concious decision.

But it is an option DRBD offers to those that want to enable it,
and may make sense in some setups.  Mostly for those that value
"online" over "most recent data", and can not afford
"administrative intervention" for whatever reason.

Maybe we should add a big red stay-out-here-be-dragons sign
to both "allow-two-primaries" and the after-split-brain
auto-recovery policy settings in the guide.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] secundary not finish synchronizing

2018-02-09 Thread Ricky Gutierrez
2018-02-08 7:28 GMT-06:00 Lars Ellenberg :

Hi Lars

> And your config is?

resource zimbradrbd {
protocol C;
# DRBD device
device /dev/drbd0;
# block device
disk /dev/hamail/opt;
meta-disk internal;

syncer {

rate 40M;

}

net {

allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;

}

startup {
become-primary-on both;

}

on node-01.domain.nl {
# IP address:port
address 192.168.20.91:7788;
}
on node-02.domain.nl {
address 192.168.20.92:7788;

}

}


> And you logs say?

Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: conn( WFConnection ->
Disconnecting )
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: Discarding network
configuration.
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: Connection closed
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: conn( Disconnecting
-> StandAlone )
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: receiver terminated
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: Terminating drbd_r_zimbradr
Feb  5 13:45:29 node-01 kernel: block drbd0: disk( UpToDate -> Failed )
Feb  5 13:45:29 node-01 kernel: block drbd0: 4622 MB (1183159 bits)
marked out-of-sync by on disk bit-map.
Feb  5 13:45:29 node-01 kernel: block drbd0: disk( Failed -> Diskless )
Feb  5 13:45:29 node-01 kernel: drbd zimbradrbd: Terminating drbd_w_zimbradr



Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: peer( Secondary ->
Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate ->
DUnknown )
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: ack_receiver terminated
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: Terminating drbd_a_zimbradr
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: Connection closed
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: conn( Disconnecting
-> StandAlone )
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: receiver terminated
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: Terminating drbd_r_zimbradr
Feb  5 14:53:52 node-01 kernel: block drbd0: disk( UpToDate -> Failed )
Feb  5 14:53:52 node-01 kernel: block drbd0: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Feb  5 14:53:52 node-01 kernel: block drbd0: disk( Failed -> Diskless )
Feb  5 14:53:52 node-01 kernel: drbd zimbradrbd: Terminating drbd_w_zimbradr



>
> As is, I can only take an (educated) wild guess:
>
> Do you have no (or improperly configured) fencing,

i don't have.

> and funky auto-resolve policies configured,
> after-split-brain ... discard-secondary maybe?



>
> Configuring automatic after-split-brain recovery policies
> is configuring automatic data loss.
>
> So if you don't mean that, don't do it.
>

I am not an expert in drbd, but with the configuration that I have
according to the documentation, should not this situation happen, or
am I wrong?

thank, I wait your answer




-- 
rickygm

http://gnuforever.homelinux.com
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] secundary not finish synchronizing

2018-02-08 Thread Lars Ellenberg
On Mon, Feb 05, 2018 at 04:38:21PM -0600, Ricky Gutierrez wrote:
> Hi list , I have a problem, I have two mail servers that replicate a
> secondary disk between them, there I have the mailboxes, they are as
> primary and secondary with centos 7, the problem is that there was a
> power cut twice, the primary server finished synchronizing with the
> secundary , and then there was another power cut and the secondary
> shutdown too, after this the power was restored and both power  on.
> 
> The problem is that I have lost data yesterday and today 

And your config is?
And you logs say?

As is, I can only take an (educated) wild guess:

Do you have no (or improperly configured) fencing,
and funky auto-resolve policies configured,
after-split-brain ... discard-secondary maybe?

Configuring automatic after-split-brain recovery policies
is configuring automatic data loss.

So if you don't mean that, don't do it.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] secundary not finish synchronizing

2018-02-05 Thread Ricky Gutierrez
Hi list , I have a problem, I have two mail servers that replicate a
secondary disk between them, there I have the mailboxes, they are as
primary and secondary with centos 7, the problem is that there was a
power cut twice, the primary server finished synchronizing with the
secundary , and then there was another power cut and the secondary
shutdown too, after this the power was restored and both power  on.

The problem is that I have lost data yesterday and today 

version of drbd

drbd84-utils-sysvinit-9.1.0-1.el7.elrepo.x86_64
kmod-drbd84-8.4.10-1_2.el7_4.elrepo.x86_64
drbd84-utils-9.1.0-1.el7.elrepo.x86_64

version: 8.4.10-1 (api:1/proto:86-101)

GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by
mockbuild@, 2017-09-15 14:23:22
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-
ns:0 nr:4778741 dw:4778741 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
wo:f oos:0

version: 8.4.10-1 (api:1/proto:86-101)

GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by
mockbuild@, 2017-09-15 14:23:22
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-
ns:4749127 nr:190015 dw:4939142 dr:821601 al:141 bm:0 lo:0 pe:0
ua:0 ap:0 ep:1 wo:f oos:0


both are syncronized, but I do not have the data of those days.



-- 
rickygm

http://gnuforever.homelinux.com
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user