Re: [ovirt-users] sanlock + gluster recovery -- RFE

2014-06-02 Thread Vijay Bellur

I am sorry, this missed my attention over the last few days.

On 05/23/2014 08:50 PM, Ted Miller wrote:

Vijay, I am not a member of the developer list, so my comments are at end.

On 5/23/2014 6:55 AM, Vijay Bellur wrote:

On 05/21/2014 10:22 PM, Federico Simoncelli wrote:

- Original Message -

From: Giuseppe Ragusa giuseppe.rag...@hotmail.com
To: fsimo...@redhat.com
Cc: users@ovirt.org
Sent: Wednesday, May 21, 2014 5:15:30 PM
Subject: sanlock + gluster recovery -- RFE

Hi,


- Original Message -

From: Ted Miller tmiller at hcjb.org
To: users users at ovirt.org
Sent: Tuesday, May 20, 2014 11:31:42 PM
Subject: [ovirt-users] sanlock + gluster recovery -- RFE

As you are aware, there is an ongoing split-brain problem with
running
sanlock on replicated gluster storage. Personally, I believe that
this is
the 5th time that I have been bitten by this sanlock+gluster problem.

I believe that the following are true (if not, my entire request is
probably
off base).


 * ovirt uses sanlock in such a way that when the sanlock
storage is
 on a
 replicated gluster file system, very small storage
disruptions can
 result in a gluster split-brain on the sanlock space


Although this is possible (at the moment) we are working hard to
avoid it.
The hardest part here is to ensure that the gluster volume is properly
configured.

The suggested configuration for a volume to be used with ovirt is:

Volume Name: (...)
Type: Replicate
Volume ID: (...)
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
(...three bricks...)
Options Reconfigured:
network.ping-timeout: 10
cluster.quorum-type: auto

The two options ping-timeout and quorum-type are really important.

You would also need a build where this bug is fixed in order to
avoid any
chance of a split-brain:

https://bugzilla.redhat.com/show_bug.cgi?id=1066996


It seems that the aforementioned bug is peculiar to 3-bricks setups.

I understand that a 3-bricks setup can allow proper quorum formation
without
resorting to first-configured-brick-has-more-weight convention
used with
only 2 bricks and quorum auto (which makes one node special, so not
properly any-single-fault tolerant).


Correct.


But, since we are on ovirt-users, is there a similar suggested
configuration
for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management
properly configured and tested-working?
I mean a configuration where any host can go south and oVirt
(through the
other one) fences it (forcibly powering it off with confirmation
from IPMI
or similar) then restarts HA-marked vms that were running there, all
the
while keeping the underlying GlusterFS-based storage domains
responsive and
readable/writeable (maybe apart from a lapse between detected
other-node
unresposiveness and confirmed fencing)?


We already had a discussion with gluster asking if it was possible to
add fencing to the replica 2 quorum/consistency mechanism.

The idea is that as soon as you can't replicate a write you have to
freeze all IO until either the connection is re-established or you
know that the other host has been killed.

Adding Vijay.

There is a related thread on gluster-devel [1] to have a better
behavior in GlusterFS for prevention of split brains with sanlock and
2-way replicated gluster volumes.

Please feel free to comment on the proposal there.

Thanks,
Vijay

[1]
http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html


One quick note before my main comment: I see references to quorum being
N/2 + 1.  Isn't if more accurate to say that quorum is (N + 1)/2 or
N/2 + 0.5?


(N + 1)/2 or  N/2 + 0.5 is fine when N happens to be odd. For both 
odd and even cases of N, N/2 + 1 does seem to be the more appropriate 
representation (assuming integer arithmetic).




Now to my main comment.

I see a case that is not being addressed.  I have no proof of how often
this use-case occurs, but I believe that is does occur.  (It could
(theoretically) occur in any situation where multiple bricks are writing
to different parts of the same file.)

Use-case: sanlock via fuse client.

Steps to produce originally

(not tested for reproducibility, because I was unable to recover the
ovirt cluster after occurrence, had to rebuild from scratch), time
frame was late 2013 or early 2014

2 node ovirt cluster using replicated gluster storage
ovirt cluster up and running VMs
remove power from network switch
restore power to network switch after a few minutes

Result

both copies of .../dom_md/ids file accused the other of being out of
sync


This case would fall under the ambit of 1. Split-brains due to network 
partition or network split-brains in the proposal on gluster-devel.




Possible solutions

Thinking about it on a systems level, the only solution I can see is
to route all writes through one gluster brick.  That way all the
accusations flow from that brick to other bricks, and gluster will
find the one file

Re: [ovirt-users] sanlock + gluster recovery -- RFE

2014-05-23 Thread Vijay Bellur

On 05/21/2014 10:22 PM, Federico Simoncelli wrote:

- Original Message -

From: Giuseppe Ragusa giuseppe.rag...@hotmail.com
To: fsimo...@redhat.com
Cc: users@ovirt.org
Sent: Wednesday, May 21, 2014 5:15:30 PM
Subject: sanlock + gluster recovery -- RFE

Hi,


- Original Message -

From: Ted Miller tmiller at hcjb.org
To: users users at ovirt.org
Sent: Tuesday, May 20, 2014 11:31:42 PM
Subject: [ovirt-users] sanlock + gluster recovery -- RFE

As you are aware, there is an ongoing split-brain problem with running
sanlock on replicated gluster storage. Personally, I believe that this is
the 5th time that I have been bitten by this sanlock+gluster problem.

I believe that the following are true (if not, my entire request is
probably
off base).


 * ovirt uses sanlock in such a way that when the sanlock storage is
 on a
 replicated gluster file system, very small storage disruptions can
 result in a gluster split-brain on the sanlock space


Although this is possible (at the moment) we are working hard to avoid it.
The hardest part here is to ensure that the gluster volume is properly
configured.

The suggested configuration for a volume to be used with ovirt is:

Volume Name: (...)
Type: Replicate
Volume ID: (...)
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
(...three bricks...)
Options Reconfigured:
network.ping-timeout: 10
cluster.quorum-type: auto

The two options ping-timeout and quorum-type are really important.

You would also need a build where this bug is fixed in order to avoid any
chance of a split-brain:

https://bugzilla.redhat.com/show_bug.cgi?id=1066996


It seems that the aforementioned bug is peculiar to 3-bricks setups.

I understand that a 3-bricks setup can allow proper quorum formation without
resorting to first-configured-brick-has-more-weight convention used with
only 2 bricks and quorum auto (which makes one node special, so not
properly any-single-fault tolerant).


Correct.


But, since we are on ovirt-users, is there a similar suggested configuration
for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management
properly configured and tested-working?
I mean a configuration where any host can go south and oVirt (through the
other one) fences it (forcibly powering it off with confirmation from IPMI
or similar) then restarts HA-marked vms that were running there, all the
while keeping the underlying GlusterFS-based storage domains responsive and
readable/writeable (maybe apart from a lapse between detected other-node
unresposiveness and confirmed fencing)?


We already had a discussion with gluster asking if it was possible to
add fencing to the replica 2 quorum/consistency mechanism.

The idea is that as soon as you can't replicate a write you have to
freeze all IO until either the connection is re-established or you
know that the other host has been killed.

Adding Vijay.




There is a related thread on gluster-devel [1] to have a better behavior 
in GlusterFS for prevention of split brains with sanlock and 2-way 
replicated gluster volumes.


Please feel free to comment on the proposal there.

Thanks,
Vijay

[1] 
http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] sanlock + gluster recovery -- RFE

2014-05-23 Thread Ted Miller

Vijay, I am not a member of the developer list, so my comments are at end.

On 5/23/2014 6:55 AM, Vijay Bellur wrote:

On 05/21/2014 10:22 PM, Federico Simoncelli wrote:

- Original Message -

From: Giuseppe Ragusa giuseppe.rag...@hotmail.com
To: fsimo...@redhat.com
Cc: users@ovirt.org
Sent: Wednesday, May 21, 2014 5:15:30 PM
Subject: sanlock + gluster recovery -- RFE

Hi,


- Original Message -

From: Ted Miller tmiller at hcjb.org
To: users users at ovirt.org
Sent: Tuesday, May 20, 2014 11:31:42 PM
Subject: [ovirt-users] sanlock + gluster recovery -- RFE

As you are aware, there is an ongoing split-brain problem with running
sanlock on replicated gluster storage. Personally, I believe that this is
the 5th time that I have been bitten by this sanlock+gluster problem.

I believe that the following are true (if not, my entire request is
probably
off base).


 * ovirt uses sanlock in such a way that when the sanlock storage is
 on a
 replicated gluster file system, very small storage disruptions can
 result in a gluster split-brain on the sanlock space


Although this is possible (at the moment) we are working hard to avoid it.
The hardest part here is to ensure that the gluster volume is properly
configured.

The suggested configuration for a volume to be used with ovirt is:

Volume Name: (...)
Type: Replicate
Volume ID: (...)
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
(...three bricks...)
Options Reconfigured:
network.ping-timeout: 10
cluster.quorum-type: auto

The two options ping-timeout and quorum-type are really important.

You would also need a build where this bug is fixed in order to avoid any
chance of a split-brain:

https://bugzilla.redhat.com/show_bug.cgi?id=1066996


It seems that the aforementioned bug is peculiar to 3-bricks setups.

I understand that a 3-bricks setup can allow proper quorum formation without
resorting to first-configured-brick-has-more-weight convention used with
only 2 bricks and quorum auto (which makes one node special, so not
properly any-single-fault tolerant).


Correct.


But, since we are on ovirt-users, is there a similar suggested configuration
for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management
properly configured and tested-working?
I mean a configuration where any host can go south and oVirt (through the
other one) fences it (forcibly powering it off with confirmation from IPMI
or similar) then restarts HA-marked vms that were running there, all the
while keeping the underlying GlusterFS-based storage domains responsive and
readable/writeable (maybe apart from a lapse between detected other-node
unresposiveness and confirmed fencing)?


We already had a discussion with gluster asking if it was possible to
add fencing to the replica 2 quorum/consistency mechanism.

The idea is that as soon as you can't replicate a write you have to
freeze all IO until either the connection is re-established or you
know that the other host has been killed.

Adding Vijay.
There is a related thread on gluster-devel [1] to have a better behavior in 
GlusterFS for prevention of split brains with sanlock and 2-way replicated 
gluster volumes.


Please feel free to comment on the proposal there.

Thanks,
Vijay

[1] http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html

One quick note before my main comment: I see references to quorum being N/2 
+ 1.  Isn't if more accurate to say that quorum is (N + 1)/2 or N/2 + 0.5?


Now to my main comment.

I see a case that is not being addressed.  I have no proof of how often this 
use-case occurs, but I believe that is does occur.  (It could (theoretically) 
occur in any situation where multiple bricks are writing to different parts 
of the same file.)


Use-case: sanlock via fuse client.

Steps to produce originally

   (not tested for reproducibility, because I was unable to recover the
   ovirt cluster after occurrence, had to rebuild from scratch), time frame
   was late 2013 or early 2014

   2 node ovirt cluster using replicated gluster storage
   ovirt cluster up and running VMs
   remove power from network switch
   restore power to network switch after a few minutes

Result

   both copies of .../dom_md/ids file accused the other of being out of sync

Hypothesis of cause

   servers (ovirt nodes and gluster bricks) are called A and B
   At the moment when network communication was lost, or just a moment after
   communication was lost

   A had written to local ids file
   A had started process to send write to B
   A had not received write confirmation from B
   and
   B had written to local ids file
   B had started process to send write to A
   B had not received write confirmation from A

   Thus, each file had a segment that had been written to the local file,
   but had not been confirmed written on the remote file.  Each file
   correctly accused the other file of being out-of-sync.  I did read

Re: [ovirt-users] sanlock + gluster recovery -- RFE

2014-05-21 Thread Federico Simoncelli
- Original Message -
 From: Ted Miller tmil...@hcjb.org
 To: users users@ovirt.org
 Sent: Tuesday, May 20, 2014 11:31:42 PM
 Subject: [ovirt-users] sanlock + gluster recovery -- RFE
 
 As you are aware, there is an ongoing split-brain problem with running
 sanlock on replicated gluster storage. Personally, I believe that this is
 the 5th time that I have been bitten by this sanlock+gluster problem.
 
 I believe that the following are true (if not, my entire request is probably
 off base).
 
 
 * ovirt uses sanlock in such a way that when the sanlock storage is on a
 replicated gluster file system, very small storage disruptions can
 result in a gluster split-brain on the sanlock space

Although this is possible (at the moment) we are working hard to avoid it.
The hardest part here is to ensure that the gluster volume is properly
configured.

The suggested configuration for a volume to be used with ovirt is:

Volume Name: (...)
Type: Replicate
Volume ID: (...)
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
(...three bricks...)
Options Reconfigured:
network.ping-timeout: 10
cluster.quorum-type: auto

The two options ping-timeout and quorum-type are really important.

You would also need a build where this bug is fixed in order to avoid any
chance of a split-brain:

https://bugzilla.redhat.com/show_bug.cgi?id=1066996

 How did I get into this mess?
 
 ...
 
 What I would like to see in ovirt to help me (and others like me). Alternates
 listed in order from most desirable (automatic) to least desirable (set of
 commands to type, with lots of variables to figure out).

The real solution is to avoid the split-brain altogether. At the moment it
seems that using the suggested configurations and the bug fix we shouldn't
hit a split-brain.

 1. automagic recovery
 
 2. recovery subcommand
 
 3. script
 
 4. commands

I think that the commands to resolve a split-brain should be documented.
I just started a page here:

http://www.ovirt.org/Gluster_Storage_Domain_Reference

Could you add your documentation there? Thanks!

-- 
Federico
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] sanlock + gluster recovery -- RFE

2014-05-21 Thread Giuseppe Ragusa
Hi,

 - Original Message -
  From: Ted Miller tmiller at hcjb.org
  To: users users at ovirt.org
  Sent: Tuesday, May 20, 2014 11:31:42 PM
  Subject: [ovirt-users] sanlock + gluster recovery -- RFE
  
  As you are aware, there is an ongoing split-brain problem with running
  sanlock on replicated gluster storage. Personally, I believe that this is
  the 5th time that I have been bitten by this sanlock+gluster problem.
  
  I believe that the following are true (if not, my entire request is probably
  off base).
  
  
  * ovirt uses sanlock in such a way that when the sanlock storage is on a
  replicated gluster file system, very small storage disruptions can
  result in a gluster split-brain on the sanlock space
 
 Although this is possible (at the moment) we are working hard to avoid it.
 The hardest part here is to ensure that the gluster volume is properly
 configured.
 
 The suggested configuration for a volume to be used with ovirt is:
 
 Volume Name: (...)
 Type: Replicate
 Volume ID: (...)
 Status: Started
 Number of Bricks: 1 x 3 = 3
 Transport-type: tcp
 Bricks:
 (...three bricks...)
 Options Reconfigured:
 network.ping-timeout: 10
 cluster.quorum-type: auto
 
 The two options ping-timeout and quorum-type are really important.
 
 You would also need a build where this bug is fixed in order to avoid any
 chance of a split-brain:
 
 https://bugzilla.redhat.com/show_bug.cgi?id=1066996

It seems that the aforementioned bug is peculiar to 3-bricks setups.

I understand that a 3-bricks setup can allow proper quorum formation without 
resorting to first-configured-brick-has-more-weight convention used with only 
2 bricks and quorum auto (which makes one node special, so not properly 
any-single-fault tolerant).

But, since we are on ovirt-users, is there a similar suggested configuration 
for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management properly 
configured and tested-working?
I mean a configuration where any host can go south and oVirt (through the 
other one) fences it (forcibly powering it off with confirmation from IPMI or 
similar) then restarts HA-marked vms that were running there, all the while 
keeping the underlying GlusterFS-based storage domains responsive and 
readable/writeable (maybe apart from a lapse between detected other-node 
unresposiveness and confirmed fencing)?

Furthermore: is such a suggested configuration possible in a self-hosted-engine 
scenario?

Regards,
Giuseppe

  How did I get into this mess?
  
  ...
  
  What I would like to see in ovirt to help me (and others like me). 
  Alternates
  listed in order from most desirable (automatic) to least desirable (set of
  commands to type, with lots of variables to figure out).
 
 The real solution is to avoid the split-brain altogether. At the moment it
 seems that using the suggested configurations and the bug fix we shouldn't
 hit a split-brain.
 
  1. automagic recovery
  
  2. recovery subcommand
  
  3. script
  
  4. commands
 
 I think that the commands to resolve a split-brain should be documented.
 I just started a page here:
 
 http://www.ovirt.org/Gluster_Storage_Domain_Reference
 
 Could you add your documentation there? Thanks!
 
 -- 
 Federico

  ___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] sanlock + gluster recovery -- RFE

2014-05-21 Thread Federico Simoncelli
- Original Message -
 From: Giuseppe Ragusa giuseppe.rag...@hotmail.com
 To: fsimo...@redhat.com
 Cc: users@ovirt.org
 Sent: Wednesday, May 21, 2014 5:15:30 PM
 Subject: sanlock + gluster recovery -- RFE
 
 Hi,
 
  - Original Message -
   From: Ted Miller tmiller at hcjb.org
   To: users users at ovirt.org
   Sent: Tuesday, May 20, 2014 11:31:42 PM
   Subject: [ovirt-users] sanlock + gluster recovery -- RFE
   
   As you are aware, there is an ongoing split-brain problem with running
   sanlock on replicated gluster storage. Personally, I believe that this is
   the 5th time that I have been bitten by this sanlock+gluster problem.
   
   I believe that the following are true (if not, my entire request is
   probably
   off base).
   
   
   * ovirt uses sanlock in such a way that when the sanlock storage is
   on a
   replicated gluster file system, very small storage disruptions can
   result in a gluster split-brain on the sanlock space
  
  Although this is possible (at the moment) we are working hard to avoid it.
  The hardest part here is to ensure that the gluster volume is properly
  configured.
  
  The suggested configuration for a volume to be used with ovirt is:
  
  Volume Name: (...)
  Type: Replicate
  Volume ID: (...)
  Status: Started
  Number of Bricks: 1 x 3 = 3
  Transport-type: tcp
  Bricks:
  (...three bricks...)
  Options Reconfigured:
  network.ping-timeout: 10
  cluster.quorum-type: auto
  
  The two options ping-timeout and quorum-type are really important.
  
  You would also need a build where this bug is fixed in order to avoid any
  chance of a split-brain:
  
  https://bugzilla.redhat.com/show_bug.cgi?id=1066996
 
 It seems that the aforementioned bug is peculiar to 3-bricks setups.
 
 I understand that a 3-bricks setup can allow proper quorum formation without
 resorting to first-configured-brick-has-more-weight convention used with
 only 2 bricks and quorum auto (which makes one node special, so not
 properly any-single-fault tolerant).

Correct.

 But, since we are on ovirt-users, is there a similar suggested configuration
 for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management
 properly configured and tested-working?
 I mean a configuration where any host can go south and oVirt (through the
 other one) fences it (forcibly powering it off with confirmation from IPMI
 or similar) then restarts HA-marked vms that were running there, all the
 while keeping the underlying GlusterFS-based storage domains responsive and
 readable/writeable (maybe apart from a lapse between detected other-node
 unresposiveness and confirmed fencing)?

We already had a discussion with gluster asking if it was possible to
add fencing to the replica 2 quorum/consistency mechanism.

The idea is that as soon as you can't replicate a write you have to
freeze all IO until either the connection is re-established or you
know that the other host has been killed.

Adding Vijay.
-- 
Federico
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] sanlock + gluster recovery -- RFE

2014-05-21 Thread Ted Miller


On 5/21/2014 11:15 AM, Giuseppe Ragusa wrote:

Hi,

 - Original Message -
  From: Ted Miller tmiller at hcjb.org
  To: users users at ovirt.org
  Sent: Tuesday, May 20, 2014 11:31:42 PM
  Subject: [ovirt-users] sanlock + gluster recovery -- RFE
 
  As you are aware, there is an ongoing split-brain problem with running
  sanlock on replicated gluster storage. Personally, I believe that this is
  the 5th time that I have been bitten by this sanlock+gluster problem.
 
  I believe that the following are true (if not, my entire request is 
probably

  off base).
 
 
  * ovirt uses sanlock in such a way that when the sanlock storage is 
on a

  replicated gluster file system, very small storage disruptions can
  result in a gluster split-brain on the sanlock space

 Although this is possible (at the moment) we are working hard to avoid it.
 The hardest part here is to ensure that the gluster volume is properly
 configured.

 The suggested configuration for a volume to be used with ovirt is:

 Volume Name: (...)
 Type: Replicate
 Volume ID: (...)
 Status: Started
 Number of Bricks: 1 x 3 = 3
 Transport-type: tcp
 Bricks:
 (...three bricks...)
 Options Reconfigured:
 network.ping-timeout: 10
 cluster.quorum-type: auto

 The two options ping-timeout and quorum-type are really important.

 You would also need a build where this bug is fixed in order to avoid any
 chance of a split-brain:

 https://bugzilla.redhat.com/show_bug.cgi?id=1066996

It seems that the aforementioned bug is peculiar to 3-bricks setups.

I understand that a 3-bricks setup can allow proper quorum formation 
without resorting to first-configured-brick-has-more-weight convention 
used with only 2 bricks and quorum auto (which makes one node special, 
so not properly any-single-fault tolerant).


But, since we are on ovirt-users, is there a similar suggested 
configuration for a 2-hosts setup oVirt+GlusterFS with oVirt-side power 
management properly configured and tested-working?
I mean a configuration where any host can go south and oVirt (through the 
other one) fences it (forcibly powering it off with confirmation from IPMI 
or similar) then restarts HA-marked vms that were running there, all the 
while keeping the underlying GlusterFS-based storage domains responsive and 
readable/writeable (maybe apart from a lapse between detected other-node 
unresposiveness and confirmed fencing)?


Furthermore: is such a suggested configuration possible in a 
self-hosted-engine scenario?


Regards,
Giuseppe

  How did I get into this mess?
 
  ...
 
  What I would like to see in ovirt to help me (and others like me). 
Alternates

  listed in order from most desirable (automatic) to least desirable (set of
  commands to type, with lots of variables to figure out).

 The real solution is to avoid the split-brain altogether. At the moment it
 seems that using the suggested configurations and the bug fix we shouldn't
 hit a split-brain.

  1. automagic recovery
 
  2. recovery subcommand
 
  3. script
 
  4. commands

 I think that the commands to resolve a split-brain should be documented.
 I just started a page here:

 http://www.ovirt.org/Gluster_Storage_Domain_Reference
I suggest you add these lines to the Gluster configuration, as I have seen 
this come up multiple times on the User list:


storage.owner-uid: 36
storage.owner-gid: 36

Ted Miller
Elkhart, IN, USA

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] sanlock + gluster recovery -- RFE

2014-05-20 Thread Ted Miller
Itamar, I am addressing this to you because one of your assignments seems to 
be to coordinate other oVirt contributors when dealing with issues that are 
raised on the ovirt-users email list.


As you are aware, there is an ongoing split-brain problem with running 
sanlock on replicated gluster storage.  Personally, I believe that this is 
the 5th time that I have been bitten by this sanlock+gluster problem.


I believe that the following are true (if not, my entire request is probably 
off base).


 * ovirt uses sanlock in such a way that when the sanlock storage is on a
   replicated gluster file system, very small storage disruptions can result
   in a gluster split-brain on the sanlock space
 o gluster is aware of the problem, and is working on a different way of
   replicating data, which will reduce these problems.
 * most (maybe all) of the sanlock locks have a short duration, measured in
   seconds
 * there are only a couple of things that a user can safely do from the
   command line when a file is in split-brain
 o delete the file
 o rename (mv) the file
 * x

_How did I get into this mess?_

had 3 hosts running ovirt 3.3
each hosted VMs
gluster replica 3 storage
engine was external to cluster
upgraded 3 hosts from ovirt 3.3 to 3.4
hosted-engine deploy
used new gluster volume (accessed via nfs) for storage
storage was accessed using localhost:engVM1 link (localhost was 
probably a poor choice)

created new engine on VM (did not transfer any data from old engine)
added 3 hosts to new engine via web-gui
ran above setup for 3 days
shut entire system down before I left on vacation (holiday)
came back from vacation
powered on hosts
found that iptables did not have rules for gluster access
(a continuing problem if host installation is allowed to set up firewall)
added rules for gluster
glusterfs now up and running
added storage manually
tried hosted-engine --vm-start
vm did not start
logs show sanlock errors
gluster volume heal engVM1full:
gluster volume heal engVM1 info split-brain showed 6 files in split-brain
all 5 prefixed by /rhev/data-center/mnt/localhost\:_engVM1
UUID/dom_md/ids
UUID/images/UUID/UUID (VM hard disk)
UUID/images/UUID/UUID.lease
UUID/ha_agent/hosted-engine.lockspace
UUID/ha_agent/hosted-engine.metadata
I copied each of the above files off of each of the three bricks to a safe 
place (15 files copied)

I renamed the 5 files on /rhev/
I copied the 5 files from one of the bricks to /rhev/
files can now be read OK (e.g. cat ids)
sanlock.log shows error sets like these:

2014-05-20 03:23:39-0400 36199 [2843]: s3358 lockspace 
5ebb3b40-a394-405b-bbac-4c0e21ccd659:1:/rhev/data-center/mnt/localhost:_engVM1/5ebb3b40-a394-405b-bbac-4c0e21ccd659/dom_md/ids:0
2014-05-20 03:23:39-0400 36199 [18873]: open error -5 
/rhev/data-center/mnt/localhost:_engVM1/5ebb3b40-a394-405b-bbac-4c0e21ccd659/dom_md/ids
2014-05-20 03:23:39-0400 36199 [18873]: s3358 open_disk 
/rhev/data-center/mnt/localhost:_engVM1/5ebb3b40-a394-405b-bbac-4c0e21ccd659/dom_md/ids
 error -5
2014-05-20 03:23:40-0400 36200 [2843]: s3358 add_lockspace fail result -19

I am now stuck

What I would like to see in ovirt to help me (and others like me). Alternates 
listed in order from most desirable (automatic) to least desirable (set of 
commands to type, with lots of variables to figure out).


1. automagic recovery

 *   When a host is not able to access sanlock, it writes a small problem
   text file into the shared storage
 o the host-ID as part of the name (so only one host ever accesses that
   file)
 o a status number for the error causing problems
 o time stamp
 o time stamp when last sanlock lease will expire
 o if sanlock is able to access the file, the problem file is deleted
 * when time passes for its last sanlock lease to be expired, highest number
   host does a survey
 o did all other hosts create problem files?
 o do all problem files show same (or compatible) error codes related
   to file access problems?
 o are all hosts communicating by network?
 o if yes to all above
 * delete all sanlock storage space
 * initialize sanlock from scratch
 * restart whatever may have given up because of sanlock
 * restart VM if necessary

2. recovery subcommand

 * add hosted-engine --lock-initialize command that would delete sanlock,
   start over from scratch

3. script

 * publish a script (in ovirt packages or available on web) which, when run,
   does all (or most) of the recovery process needed.

4. commands

 * publish on the web a recipe for dealing with files that commonly go
   split-brain
 o ids
 o *.lease
 o *.lockspace

Any chance of any help on any of the above levels?

Ted Miller
Elkhart, IN, USA

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users