Henning

You are probably right that this isn't the right place to continue with
non-bug questions though Andres or others could answer that more
acurately.

I would recommend drbd-users mailing list as there are many experts
there for config and troubleshooting.  Also the pacemaker mailing lists
is a good one.

*snip*

I don't have any idea why your getting the broken pipe... but do you
have STONITH/fencing configured?!

Normally when you have a comm link break like that then you would want
Pacemaker to STONITH the disconnected node.  Prior to the STONITH which
ever DRBD node is going to survive should fence the resource preventing
it from becoming primary until it is UpToDate.

I use the fence peer handler in DRBD set to resource only and then have
STONITH configure in Pacemaker.  This way I can have a break in DRBD
that doesn't automatically STONITH the node (but prevents the borked
DRBD from coming up as Master and causing split brain) unless the
Cluster communications are also dead at which point Pacemaker will shoot
the node.

I think the lack of fencing/STONITH is causing the split brain because
both nodes do their own thing when not communicating which causes the
diverging data set.

> 
> This causes a split brain every time this happens even though there
> are
> no writes on the devices yet.

You have your split brain handling configured like this still?:
    after-sb-0pri disconnect;
    after-sb-1pri consensus;
    after-sb-2pri disconnect;

Your are telling it to disconnect regardless of changes with these split
brain lines (I believe it's part of the cause).  Have you considered
using some more agressive split brain handling if your going with dual
primary?  They can be controversial topic due to data loss but...

In the users guide at the bottom of this page it lists split brain
behaviors that are considered OK for dual primary/clustered filesystem
setups:

http://www.drbd.org/users-guide/s-configure-split-brain-behavior.html

    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;

You would likely hit sb-1pri if you had fencing.  It would go something
like this:

Break in comms
Node 1 is fenced preventing it from becoming master
Pacemaker shoots node 1
Node 1 reboots and (if setup to auto start the cluster) pacemaker accepts the 
node back into the cluster
Drbd links up and finds Node 1 is diverged
Node 1 is fenced so it is not master right now
Considers the after-sb-1pri rule - this assumes that since it's dual primary if 
you have one primary then the dataset on that primary is always good and there 
is no need to perserve the secondary data so just overwrite it.
Basically executes the commands you did manually and discards the Node 1 data.
Once Node 1 is UpToDate DRBD removes the fencing and allows Node 1 to become 
master


I hope all of that wasn't too confusing!

Jake

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to ocfs2-tools in Ubuntu.
https://bugs.launchpad.net/bugs/799711

Title:
  o2cb[11796]: ERROR: ocfs2_controld.pcmk did not come up

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ocfs2-tools/+bug/799711/+subscriptions

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs

Reply via email to