[ovirt-users] Re: [Gluster-users] Re: VM disk corruption with LSM on Gluster

2019-03-27 Thread Sander Hoentjen
Hi Krutika, Leo,

Sounds promising. I will test this too, and report back tomorrow (or
maybe sooner, if corruption occurs again).

-- Sander


On 27-03-19 10:00, Krutika Dhananjay wrote:
> This is needed to prevent any inconsistencies stemming from buffered
> writes/caching file data during live VM migration.
> Besides, for Gluster to truly honor direct-io behavior in qemu's
> 'cache=none' mode (which is what oVirt uses),
> one needs to turn on performance.strict-o-direct and disable remote-dio.
>
> -Krutika
>
> On Wed, Mar 27, 2019 at 12:24 PM Leo David  <mailto:leoa...@gmail.com>> wrote:
>
> Hi,
> I can confirm that after setting these two options, I haven't
> encountered disk corruptions anymore.
> The downside, is that at least for me it had a pretty big impact
> on performance.
> The iops really went down - performing  inside vm fio tests.
>
> On Wed, Mar 27, 2019, 07:03 Krutika Dhananjay  <mailto:kdhan...@redhat.com>> wrote:
>
> Could you enable strict-o-direct and disable remote-dio on the
> src volume as well, restart the vms on "old" and retry migration?
>
> # gluster volume set  performance.strict-o-direct on
> # gluster volume set  network.remote-dio off
>
> -Krutika
>
> On Tue, Mar 26, 2019 at 10:32 PM Sander Hoentjen
> mailto:san...@hoentjen.eu>> wrote:
>
> On 26-03-19 14:23, Sahina Bose wrote:
> > +Krutika Dhananjay and gluster ml
> >
> > On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen
> mailto:san...@hoentjen.eu>> wrote:
> >> Hello,
> >>
> >> tl;dr We have disk corruption when doing live storage
> migration on oVirt
> >> 4.2 with gluster 3.12.15. Any idea why?
> >>
> >> We have a 3-node oVirt cluster that is both compute and
> gluster-storage.
> >> The manager runs on separate hardware. We are running
> out of space on
> >> this volume, so we added another Gluster volume that is
> bigger, put a
> >> storage domain on it and then we migrated VM's to it
> with LSM. After
> >> some time, we noticed that (some of) the migrated VM's
> had corrupted
> >> filesystems. After moving everything back with
> export-import to the old
> >> domain where possible, and recovering from backups
> where needed we set
> >> off to investigate this issue.
> >>
> >> We are now at the point where we can reproduce this
> issue within a day.
> >> What we have found so far:
> >> 1) The corruption occurs at the very end of the
> replication step, most
> >> probably between START and FINISH of
> diskReplicateFinish, before the
> >> START merge step
> >> 2) In the corrupted VM, at some place where data should
> be, this data is
> >> replaced by zero's. This can be file-contents or a
> directory-structure
> >> or whatever.
> >> 3) The source gluster volume has different settings
> then the destination
> >> (Mostly because the defaults were different at creation
> time):
> >>
> >> Setting                                 old(src)  new(dst)
> >> cluster.op-version                      30800     30800
> (the same)
> >> cluster.max-op-version                  31202     31202
> (the same)
> >> cluster.metadata-self-heal              off       on
> >> cluster.data-self-heal                  off       on
> >> cluster.entry-self-heal                 off       on
> >> performance.low-prio-threads            16        32
> >> performance.strict-o-direct             off       on
> >> network.ping-timeout                    42        30
> >> network.remote-dio                      enable    off
> >> transport.address-family                -         inet
> >> performance.stat-prefetch               off       on
> >> features.shard-block-size               512MB     64MB
> >> clust

[ovirt-users] Re: VM disk corruption with LSM on Gluster

2019-03-26 Thread Sander Hoentjen
On 26-03-19 14:23, Sahina Bose wrote:
> +Krutika Dhananjay and gluster ml
>
> On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen  wrote:
>> Hello,
>>
>> tl;dr We have disk corruption when doing live storage migration on oVirt
>> 4.2 with gluster 3.12.15. Any idea why?
>>
>> We have a 3-node oVirt cluster that is both compute and gluster-storage.
>> The manager runs on separate hardware. We are running out of space on
>> this volume, so we added another Gluster volume that is bigger, put a
>> storage domain on it and then we migrated VM's to it with LSM. After
>> some time, we noticed that (some of) the migrated VM's had corrupted
>> filesystems. After moving everything back with export-import to the old
>> domain where possible, and recovering from backups where needed we set
>> off to investigate this issue.
>>
>> We are now at the point where we can reproduce this issue within a day.
>> What we have found so far:
>> 1) The corruption occurs at the very end of the replication step, most
>> probably between START and FINISH of diskReplicateFinish, before the
>> START merge step
>> 2) In the corrupted VM, at some place where data should be, this data is
>> replaced by zero's. This can be file-contents or a directory-structure
>> or whatever.
>> 3) The source gluster volume has different settings then the destination
>> (Mostly because the defaults were different at creation time):
>>
>> Setting old(src)  new(dst)
>> cluster.op-version  30800 30800 (the same)
>> cluster.max-op-version  31202 31202 (the same)
>> cluster.metadata-self-heal  off   on
>> cluster.data-self-heal  off   on
>> cluster.entry-self-heal off   on
>> performance.low-prio-threads1632
>> performance.strict-o-direct off   on
>> network.ping-timeout4230
>> network.remote-dio  enableoff
>> transport.address-family- inet
>> performance.stat-prefetch   off   on
>> features.shard-block-size   512MB 64MB
>> cluster.shd-max-threads 1 8
>> cluster.shd-wait-qlength1024  1
>> cluster.locking-scheme  full  granular
>> cluster.granular-entry-heal noenable
>>
>> 4) To test, we migrate some VM's back and forth. The corruption does not
>> occur every time. To this point it only occurs from old to new, but we
>> don't have enough data-points to be sure about that.
>>
>> Anybody an idea what is causing the corruption? Is this the best list to
>> ask, or should I ask on a Gluster list? I am not sure if this is oVirt
>> specific or Gluster specific though.
> Do you have logs from old and new gluster volumes? Any errors in the
> new volume's fuse mount logs?

Around the time of corruption I see the message:
The message "I [MSGID: 133017] [shard.c:4941:shard_seek] 
0-ZoneA_Gluster1-shard: seek called on 7fabc273-3d8a-4a49-8906-b8ccbea4a49f. 
[Operation not supported]" repeated 231 times between [2019-03-26 
13:14:22.297333] and [2019-03-26 13:15:42.912170]

I also see this message at other times, when I don't see the corruption occur, 
though.

-- 
Sander
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/M3T2VGGGV6DE643ZKKJUAF274VSWTJFH/


[ovirt-users] VM disk corruption with LSM on Gluster

2019-03-26 Thread Sander Hoentjen
Hello,

tl;dr We have disk corruption when doing live storage migration on oVirt
4.2 with gluster 3.12.15. Any idea why?

We have a 3-node oVirt cluster that is both compute and gluster-storage.
The manager runs on separate hardware. We are running out of space on
this volume, so we added another Gluster volume that is bigger, put a
storage domain on it and then we migrated VM's to it with LSM. After
some time, we noticed that (some of) the migrated VM's had corrupted
filesystems. After moving everything back with export-import to the old
domain where possible, and recovering from backups where needed we set
off to investigate this issue.

We are now at the point where we can reproduce this issue within a day.
What we have found so far:
1) The corruption occurs at the very end of the replication step, most
probably between START and FINISH of diskReplicateFinish, before the
START merge step
2) In the corrupted VM, at some place where data should be, this data is
replaced by zero's. This can be file-contents or a directory-structure
or whatever.
3) The source gluster volume has different settings then the destination
(Mostly because the defaults were different at creation time):

Setting old(src)  new(dst)
cluster.op-version  30800 30800 (the same)
cluster.max-op-version  31202 31202 (the same)
cluster.metadata-self-heal  off   on
cluster.data-self-heal  off   on
cluster.entry-self-heal off   on
performance.low-prio-threads    16    32
performance.strict-o-direct off   on
network.ping-timeout    42    30
network.remote-dio  enable    off
transport.address-family    - inet 
performance.stat-prefetch   off   on
features.shard-block-size   512MB 64MB
cluster.shd-max-threads 1 8
cluster.shd-wait-qlength    1024  1
cluster.locking-scheme  full  granular
cluster.granular-entry-heal no    enable

4) To test, we migrate some VM's back and forth. The corruption does not
occur every time. To this point it only occurs from old to new, but we
don't have enough data-points to be sure about that.

Anybody an idea what is causing the corruption? Is this the best list to
ask, or should I ask on a Gluster list? I am not sure if this is oVirt
specific or Gluster specific though.

Kind regards,
Sander Hoentjen
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/43E2QYJYDHPYTIU3IFS53WS4WL5OFXUV/


Re: [ovirt-users] Ovirt/Gluster

2015-08-28 Thread Sander Hoentjen



On 08/21/2015 06:12 PM, Ravishankar N wrote:



On 08/21/2015 07:57 PM, Sander Hoentjen wrote:

Maybe I should formulate some clear questions:
1) Am I correct in assuming that an issue on of of 3 gluster nodes 
should not cause downtime for VM's on other nodes?


From what I understand, yes. Maybe the ovirt folks can confirm. I can 
tell you this much for sure: If you create a replica 3 volume using 3 
nodes, mount the volume locally on each node, and bring down one node, 
the mounts from the other 2 nodes *must* have read+write access to the 
volume.




2) What can I/we do to fix the issue I am seeing?
3) Can anybody else reproduce my issue?

I'll try and see if I can.


Hi Ravi,

Did you get around to this by any chance? This is a blocker issue for 
us. Apart from that, has anybody else have any success with using 
gluster reliably as an ovirt storage solution?


Regards,
Sander
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt/Gluster

2015-08-21 Thread Sander Hoentjen



On 08/21/2015 11:30 AM, Ravishankar N wrote:



On 08/21/2015 01:21 PM, Sander Hoentjen wrote:



On 08/21/2015 09:28 AM, Ravishankar N wrote:



On 08/20/2015 02:14 PM, Sander Hoentjen wrote:



On 08/19/2015 09:04 AM, Ravishankar N wrote:



On 08/18/2015 04:22 PM, Ramesh Nachimuthu wrote:

+ Ravi from gluster.

Regards,
Ramesh

- Original Message -
From: Sander Hoentjen san...@hoentjen.eu
To: users@ovirt.org
Sent: Tuesday, August 18, 2015 3:30:35 PM
Subject: [ovirt-users] Ovirt/Gluster

Hi,

We are looking for some easy to manage self contained VM hosting. 
Ovirt
with GlusterFS seems to fit that bill perfectly. I installed it 
and then
starting kicking the tires. First results looked promising, but 
now I

can get a VM to pause indefinitely fairly easy:

My setup is 3 hosts that are in a Virt and Gluster cluster. 
Gluster is
setup as replica-3. The gluster export is used as the storage 
domain for

the VM's.


Hi,

What version of gluster and ovirt are you using?

glusterfs-3.7.3-1.el7.x86_64
vdsm-4.16.20-0.el7.centos.x86_64
ovirt-engine-3.5.3.1-1.el7.centos.noarch




Now when I start the VM all is good, performance is good enough 
so we

are happy. I then start bonnie++ to generate some load. I have a VM
running on host 1, host 2 is SPM and all 3 VM's are seeing some 
network

traffic courtesy of gluster.

Now, for fun, suddenly the network on host3 goes bad (iptables -I 
OUTPUT

-m statistic --mode random --probability 0.75 -j REJECT).
Some time later I see the guest has a small hickup, I'm 
guessing that
is when gluster decides host 3 is not allowed to play anymore. No 
big

deal anyway.
After a while 25% of packages just isn't good enough for Ovirt 
anymore,

so the host will be fenced.


I'm not sure what fencing means w.r.t ovirt and what it actually 
fences. As far is gluster is concerned, since only one node is 
blocked, the VM image should still be accessible by the VM running 
on host1.
Fencing means (at least in this case) that the IPMI of the server 
does a power reset.

After a reboot *sometimes* the VM will be
paused, and even after the gluster self-heal is complete it can 
not be

unpaused, has to be restarted.


Could you provide the gluster mount (fuse?) logs and the brick 
logs of all 3 nodes when the VM is paused? That should give us 
some clue.



Logs are attached. Problem was at around 8:15 - 8:20 UTC
This time however the vm stopped even without a reboot of hyp03



The mount logs  (rhev-data-center-mnt-glusterSD*) are indicating 
frequent disconnects to the bricks  with 'clnt_ping_timer_expired', 
'Client-quorum is not met' and 'Read-only file system' messages.
client-quorum is enabled by default for replica 3 volumes. So if the 
mount cannot connect to 2 bricks at least, quorum is lost and the 
gluster volume becomes read-only. That seems to be the reason why 
the VMs are pausing.
I'm not sure if the frequent disconnects are due a flaky network or 
the bricks not responding to the mount's ping timer due to it's 
epoll threads busy with I/O (unlikely). Can you also share the 
output of `gluster volume info volname` ?
The frequent disconnects are probably because I intentionally broke 
the network on hyp03 (dropped 75% of outgoing packets). In my opinion 
this should not affect the VM an hyp02. Am I wrong to think that?



For client-quorum, If a client (mount)  cannot connect to the number 
of bricks to achieve quorum, the client becomes read-only. So if the 
client on hyp02 can see itself and 01, it shouldn't be affected.

But it was, and I only broke hyp03.




[root@hyp01 ~]# gluster volume info VMS

Volume Name: VMS
Type: Replicate
Volume ID: 9e6657e7-8520-4720-ba9d-78b14a86c8ca
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.99.50.20:/brick/VMS
Brick2: 10.99.50.21:/brick/VMS
Brick3: 10.99.50.22:/brick/VMS
Options Reconfigured:
performance.readdir-ahead: on
nfs.disable: on
user.cifs: disable
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server


I see that you have enabled server-quorum too. Since you blocked 
hyp03, the if the glusterd on that node cannot  see the other 2 nodes 
due to iptable rules, it would kill all brick processes. See the 7 
How To Test  section in 
http://www.gluster.org/community/documentation/index.php/Features/Server-quorum 
to get a better idea of server-quorum.


Yes but it should only kill the bricks on hyp03, right? So then why does 
the VM on hyp02 die? I don't like the fact that a problem on any one of 
the hosts can bring down any VM on any host.


--
Sander
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt/Gluster

2015-08-21 Thread Sander Hoentjen



On 08/21/2015 02:21 PM, Ravishankar N wrote:



On 08/21/2015 04:32 PM, Sander Hoentjen wrote:



On 08/21/2015 11:30 AM, Ravishankar N wrote:



On 08/21/2015 01:21 PM, Sander Hoentjen wrote:



On 08/21/2015 09:28 AM, Ravishankar N wrote:



On 08/20/2015 02:14 PM, Sander Hoentjen wrote:



On 08/19/2015 09:04 AM, Ravishankar N wrote:



On 08/18/2015 04:22 PM, Ramesh Nachimuthu wrote:

+ Ravi from gluster.

Regards,
Ramesh

- Original Message -
From: Sander Hoentjen san...@hoentjen.eu
To: users@ovirt.org
Sent: Tuesday, August 18, 2015 3:30:35 PM
Subject: [ovirt-users] Ovirt/Gluster

Hi,

We are looking for some easy to manage self contained VM 
hosting. Ovirt
with GlusterFS seems to fit that bill perfectly. I installed it 
and then
starting kicking the tires. First results looked promising, but 
now I

can get a VM to pause indefinitely fairly easy:

My setup is 3 hosts that are in a Virt and Gluster cluster. 
Gluster is
setup as replica-3. The gluster export is used as the storage 
domain for

the VM's.


Hi,

What version of gluster and ovirt are you using?

glusterfs-3.7.3-1.el7.x86_64
vdsm-4.16.20-0.el7.centos.x86_64
ovirt-engine-3.5.3.1-1.el7.centos.noarch




Now when I start the VM all is good, performance is good enough 
so we
are happy. I then start bonnie++ to generate some load. I have 
a VM
running on host 1, host 2 is SPM and all 3 VM's are seeing some 
network

traffic courtesy of gluster.

Now, for fun, suddenly the network on host3 goes bad (iptables 
-I OUTPUT

-m statistic --mode random --probability 0.75 -j REJECT).
Some time later I see the guest has a small hickup, I'm 
guessing that
is when gluster decides host 3 is not allowed to play anymore. 
No big

deal anyway.
After a while 25% of packages just isn't good enough for Ovirt 
anymore,

so the host will be fenced.


I'm not sure what fencing means w.r.t ovirt and what it actually 
fences. As far is gluster is concerned, since only one node is 
blocked, the VM image should still be accessible by the VM 
running on host1.
Fencing means (at least in this case) that the IPMI of the server 
does a power reset.

After a reboot *sometimes* the VM will be
paused, and even after the gluster self-heal is complete it can 
not be

unpaused, has to be restarted.


Could you provide the gluster mount (fuse?) logs and the brick 
logs of all 3 nodes when the VM is paused? That should give us 
some clue.



Logs are attached. Problem was at around 8:15 - 8:20 UTC
This time however the vm stopped even without a reboot of hyp03



The mount logs  (rhev-data-center-mnt-glusterSD*) are indicating 
frequent disconnects to the bricks  with 
'clnt_ping_timer_expired', 'Client-quorum is not met' and 
'Read-only file system' messages.
client-quorum is enabled by default for replica 3 volumes. So if 
the mount cannot connect to 2 bricks at least, quorum is lost and 
the gluster volume becomes read-only. That seems to be the reason 
why the VMs are pausing.
I'm not sure if the frequent disconnects are due a flaky network 
or the bricks not responding to the mount's ping timer due to it's 
epoll threads busy with I/O (unlikely). Can you also share the 
output of `gluster volume info volname` ?
The frequent disconnects are probably because I intentionally broke 
the network on hyp03 (dropped 75% of outgoing packets). In my 
opinion this should not affect the VM an hyp02. Am I wrong to think 
that?



For client-quorum, If a client (mount)  cannot connect to the number 
of bricks to achieve quorum, the client becomes read-only. So if the 
client on hyp02 can see itself and 01, it shouldn't be affected.

But it was, and I only broke hyp03.


Beats me then. I see [2015-08-18 15:15:27.922998] W [MSGID: 108001] 
[afr-common.c:4043:afr_notify] 0-VMS-replicate-0: Client-quorum is not 
met on hyp02's mount log but the time stamp is earlier than when you 
say you observed the hang (2015-08-20, around 8:15 - 8:20 UTC?).  
(they do occur in that time on hyp03 though).
Yeah that event is from before. For your information: This setup is used 
to test, so I try to break it and hope I don't succeed. Unfortunately I 
succeeded.






[root@hyp01 ~]# gluster volume info VMS

Volume Name: VMS
Type: Replicate
Volume ID: 9e6657e7-8520-4720-ba9d-78b14a86c8ca
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.99.50.20:/brick/VMS
Brick2: 10.99.50.21:/brick/VMS
Brick3: 10.99.50.22:/brick/VMS
Options Reconfigured:
performance.readdir-ahead: on
nfs.disable: on
user.cifs: disable
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server


I see that you have enabled server-quorum too. Since you blocked 
hyp03, the if the glusterd on that node cannot  see the other 2 
nodes due to iptable rules, it would kill all brick processes. See 
the 7 How To Test  section

Re: [ovirt-users] Ovirt/Gluster

2015-08-21 Thread Sander Hoentjen



On 08/21/2015 09:28 AM, Ravishankar N wrote:



On 08/20/2015 02:14 PM, Sander Hoentjen wrote:



On 08/19/2015 09:04 AM, Ravishankar N wrote:



On 08/18/2015 04:22 PM, Ramesh Nachimuthu wrote:

+ Ravi from gluster.

Regards,
Ramesh

- Original Message -
From: Sander Hoentjen san...@hoentjen.eu
To: users@ovirt.org
Sent: Tuesday, August 18, 2015 3:30:35 PM
Subject: [ovirt-users] Ovirt/Gluster

Hi,

We are looking for some easy to manage self contained VM hosting. 
Ovirt
with GlusterFS seems to fit that bill perfectly. I installed it and 
then

starting kicking the tires. First results looked promising, but now I
can get a VM to pause indefinitely fairly easy:

My setup is 3 hosts that are in a Virt and Gluster cluster. Gluster is
setup as replica-3. The gluster export is used as the storage 
domain for

the VM's.


Hi,

What version of gluster and ovirt are you using?

glusterfs-3.7.3-1.el7.x86_64
vdsm-4.16.20-0.el7.centos.x86_64
ovirt-engine-3.5.3.1-1.el7.centos.noarch




Now when I start the VM all is good, performance is good enough so we
are happy. I then start bonnie++ to generate some load. I have a VM
running on host 1, host 2 is SPM and all 3 VM's are seeing some 
network

traffic courtesy of gluster.

Now, for fun, suddenly the network on host3 goes bad (iptables -I 
OUTPUT

-m statistic --mode random --probability 0.75 -j REJECT).
Some time later I see the guest has a small hickup, I'm guessing 
that

is when gluster decides host 3 is not allowed to play anymore. No big
deal anyway.
After a while 25% of packages just isn't good enough for Ovirt 
anymore,

so the host will be fenced.


I'm not sure what fencing means w.r.t ovirt and what it actually 
fences. As far is gluster is concerned, since only one node is 
blocked, the VM image should still be accessible by the VM running 
on host1.
Fencing means (at least in this case) that the IPMI of the server 
does a power reset.

After a reboot *sometimes* the VM will be
paused, and even after the gluster self-heal is complete it can not be
unpaused, has to be restarted.


Could you provide the gluster mount (fuse?) logs and the brick logs 
of all 3 nodes when the VM is paused? That should give us some clue.



Logs are attached. Problem was at around 8:15 - 8:20 UTC
This time however the vm stopped even without a reboot of hyp03



The mount logs  (rhev-data-center-mnt-glusterSD*) are indicating 
frequent disconnects to the bricks  with 'clnt_ping_timer_expired', 
'Client-quorum is not met' and 'Read-only file system' messages.
client-quorum is enabled by default for replica 3 volumes. So if the 
mount cannot connect to 2 bricks at least, quorum is lost and the 
gluster volume becomes read-only. That seems to be the reason why the 
VMs are pausing.
I'm not sure if the frequent disconnects are due a flaky network or 
the bricks not responding to the mount's ping timer due to it's epoll 
threads busy with I/O (unlikely). Can you also share the output of 
`gluster volume info volname` ?
The frequent disconnects are probably because I intentionally broke the 
network on hyp03 (dropped 75% of outgoing packets). In my opinion this 
should not affect the VM an hyp02. Am I wrong to think that?


[root@hyp01 ~]# gluster volume info VMS

Volume Name: VMS
Type: Replicate
Volume ID: 9e6657e7-8520-4720-ba9d-78b14a86c8ca
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.99.50.20:/brick/VMS
Brick2: 10.99.50.21:/brick/VMS
Brick3: 10.99.50.22:/brick/VMS
Options Reconfigured:
performance.readdir-ahead: on
nfs.disable: on
user.cifs: disable
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36

--
Sander
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] Ovirt/Gluster

2015-08-18 Thread Sander Hoentjen

Hi,

We are looking for some easy to manage self contained VM hosting. Ovirt 
with GlusterFS seems to fit that bill perfectly. I installed it and then 
starting kicking the tires. First results looked promising, but now I 
can get a VM to pause indefinitely fairly easy:


My setup is 3 hosts that are in a Virt and Gluster cluster. Gluster is 
setup as replica-3. The gluster export is used as the storage domain for 
the VM's.


Now when I start the VM all is good, performance is good enough so we 
are happy. I then start bonnie++ to generate some load. I have a VM 
running on host 1, host 2 is SPM and all 3 VM's are seeing some network 
traffic courtesy of gluster.


Now, for fun, suddenly the network on host3 goes bad (iptables -I OUTPUT 
-m statistic --mode random --probability 0.75 -j REJECT).
Some time later I see the guest has a small hickup, I'm guessing that 
is when gluster decides host 3 is not allowed to play anymore. No big 
deal anyway.
After a while 25% of packages just isn't good enough for Ovirt anymore, 
so the host will be fenced. After a reboot *sometimes* the VM will be 
paused, and even after the gluster self-heal is complete it can not be 
unpaused, has to be restarted.


Is there anything I can do to prevent the VM from being paused?

Regards,
Sander

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users