[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-05-02 Thread olaf . buitelaar
Sorry it appears the messages about;  Get Host Statistics failed: Internal 
JSON-RPC error: {'reason': '[Errno 19] veth18ae509 is not present in the 
system'} aren't gone, just are happening much less frequent.

Best Olaf
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2EIX6GDT5DXUTARXYXYUH2OV6N55XUJ7/


[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-04-29 Thread olaf . buitelaar
Dear Mohit,

I've upgraded to gluster 5.6, however the starting of multiple glusterfsd 
processed per brick doesn't seems to be fully resolved yet. However it does 
seem to happen less than before. Also in some cases glusterd did seem to detect 
a glusterfsd was running, but decided it was not valid. It was reproducible on 
all my machines after a reboot, but only a few bricks seemed to be affected. 
I'm running about 14 bricks per machine, and only 1 - 3 were affected. The ones 
with 3 full  bricks, seemed tp suffer most. Also in some cases a restart of the 
glusterd service did spawn multiple glusterfsd processed for the same bricks 
configured on the node. 

See for example logs;
[2019-04-19 17:49:50.853099] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:50:33.302239] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:56:11.287692] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:57:12.699967] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 14884 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-19 17:57:12.700150] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:02:58.420870] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:03:29.420891] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:48:14.046029] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:55:04.508606] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core

or

[2019-04-18 17:00:00.665476] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:00:32.799529] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:02:38.271880] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:08:32.867046] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:00.440336] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 9278 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:00.440476] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:07.644070] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 24126 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:07.644184] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:13.785798] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 27197 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:13.785918] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:24.344561] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 28468 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:24.344675] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:37:07.150799] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 18:17:23.203719] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core

Again the the procedure to resolve this, was kill all the glusterfsd processed 
for the brick, and do a gluster v  start force, which resulted in only 1 
processes being started.

After the upgrade to 5.6 i do notice a small performance improvement of around 
15%, but it's still far from 3.12.15. I 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-04-07 Thread Mohit Agrawal
Hi,

Thanks Olaf for sharing the relevant logs.

@Atin,
You are right patch https://review.gluster.org/#/c/glusterfs/+/22344/ will
resolve the issue running multiple brick instance for same brick.

As we can see in below logs glusterd is trying to start the same brick
instance twice at the same time

[2019-04-01 10:23:21.752401] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-engine
[2019-04-01 10:23:30.348091] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-engine
[2019-04-01 10:24:13.353396] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-engine
[2019-04-01 10:24:24.253764] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-engine

We are seeing below message between starting of two instances
The message "E [MSGID: 101012] [common-utils.c:4075:gf_is_service_running]
0-: Unable to read pidfile:
/var/run/gluster/vols/ovirt-engine/10.32.9.5-data-gfs-bricks-brick1-ovirt-engine.pid"
repeated 2 times between [2019-04-01 10:23:21.748492] and [2019-04-01
10:23:21.752432]

I will backport the same.
Thanks,
Mohit Agrawal

On Wed, Apr 3, 2019 at 3:58 PM Olaf Buitelaar 
wrote:

> Dear Mohit,
>
> Sorry i thought Krutika was referring to the ovirt-kube brick logs. due
> the large size (18MB compressed), i've placed the files here;
> https://edgecastcdn.net/0004FA/files/bricklogs.tar.bz2
> Also i see i've attached the wrong files, i intended to
> attach profile_data4.txt | profile_data3.txt
> Sorry for the confusion.
>
> Thanks Olaf
>
> Op wo 3 apr. 2019 om 04:56 schreef Mohit Agrawal :
>
>> Hi Olaf,
>>
>>   As per current attached "multi-glusterfsd-vol3.txt |
>> multi-glusterfsd-vol4.txt" it is showing multiple processes are running
>>   for "ovirt-core ovirt-engine" brick names but there are no logs
>> available in bricklogs.zip specific to this bricks, bricklogs.zip
>>   has a dump of ovirt-kube logs only
>>
>>   Kindly share brick logs specific to the bricks "ovirt-core
>> ovirt-engine" and share glusterd logs also.
>>
>> Regards
>> Mohit Agrawal
>>
>> On Tue, Apr 2, 2019 at 9:18 PM Olaf Buitelaar 
>> wrote:
>>
>>> Dear Krutika,
>>>
>>> 1.
>>> I've changed the volume settings, write performance seems to increased
>>> somewhat, however the profile doesn't really support that since latencies
>>> increased. However read performance has diminished, which does seem to be
>>> supported by the profile runs (attached).
>>> Also the IO does seem to behave more consistent than before.
>>> I don't really understand the idea behind them, maybe you can explain
>>> why these suggestions are good?
>>> These settings seems to avoid as much local caching and access as
>>> possible and push everything to the gluster processes. While i would expect
>>> local access and local caches are a good thing, since it would lead to
>>> having less network access or disk access.
>>> I tried to investigate these settings a bit more, and this is what i
>>> understood of them;
>>> - network.remote-dio; when on it seems to ignore the O_DIRECT flag in
>>> the client, thus causing the files to be cached and buffered in the page
>>> cache on the client, i would expect this to be a good thing especially if
>>> the server process would access the same page cache?
>>> At least that is what grasp from this commit;
>>> https://review.gluster.org/#/c/glusterfs/+/4206/2/xlators/protocol/client/src/client.c
>>>  line
>>> 867
>>> Also found this commit;
>>> https://github.com/gluster/glusterfs/commit/06c4ba589102bf92c58cd9fba5c60064bc7a504e#diff-938709e499b4383c3ed33c3979b9080c
>>>  suggesting
>>> remote-dio actually improves performance, not sure it's a write or read
>>> benchmark
>>> When a file is opened with O_DIRECT it will also disable the
>>> write-behind functionality
>>>
>>> - performance.strict-o-direct: when on, the AFR, will not ignore the
>>> O_DIRECT flag. and will invoke: fop_writev_stub with the wb_writev_helper,
>>> which seems to stack the operation, no idea why that is. But generally i
>>> suppose not ignoring the O_DIRECT flag in the AFR is a good thing, when a
>>> processes requests to have O_DIRECT. So this makes sense to me.
>>>
>>> - cluster.choose-local: when off, it doesn't prefer the local node, but
>>> would always choose a brick. Since it's a 9 node cluster, with 3
>>> subvolumes, only a 1/3 could end-up local, and the other 2/3 should be
>>> pushed to external nodes anyway. Or am I making the total wrong assumption
>>> here?
>>>
>>> It seems to this config is moving to the gluster-block config side of
>>> things, which does make sense.
>>> Since we're running quite some mysql instances, which opens the files
>>> with O_DIRECt i believe, it would mean the only layer of cache is within
>>> mysql it self. 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-04-03 Thread Olaf Buitelaar
Dear  Mohit,

Thanks for backporting this issue. Hopefully we can address the others as
well, if i can do anything let me know.
On my side i've tested with: gluster volume reset 
cluster.choose-local, but haven't noticed really a change in performance.
On the good side, the brick processes didn't crash with updating this
config.
I'll experiment with the other changes as well, and see how the
combinations affect performance.
I also saw this commit; https://review.gluster.org/#/c/glusterfs/+/21333/
which looks very useful, will this be an recommended option for VM/block
workloads?

Thanks Olaf


Op wo 3 apr. 2019 om 17:56 schreef Mohit Agrawal :

>
> Hi,
>
> Thanks Olaf for sharing the relevant logs.
>
> @Atin,
> You are right patch https://review.gluster.org/#/c/glusterfs/+/22344/
> will resolve the issue running multiple brick instance for same brick.
>
> As we can see in below logs glusterd is trying to start the same brick
> instance twice at the same time
>
> [2019-04-01 10:23:21.752401] I
> [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh
> brick process for brick /data/gfs/bricks/brick1/ovirt-engine
> [2019-04-01 10:23:30.348091] I
> [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh
> brick process for brick /data/gfs/bricks/brick1/ovirt-engine
> [2019-04-01 10:24:13.353396] I
> [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh
> brick process for brick /data/gfs/bricks/brick1/ovirt-engine
> [2019-04-01 10:24:24.253764] I
> [glusterd-utils.c:6301:glusterd_brick_start] 0-management: starting a fresh
> brick process for brick /data/gfs/bricks/brick1/ovirt-engine
>
> We are seeing below message between starting of two instances
> The message "E [MSGID: 101012] [common-utils.c:4075:gf_is_service_running]
> 0-: Unable to read pidfile:
> /var/run/gluster/vols/ovirt-engine/10.32.9.5-data-gfs-bricks-brick1-ovirt-engine.pid"
> repeated 2 times between [2019-04-01 10:23:21.748492] and [2019-04-01
> 10:23:21.752432]
>
> I will backport the same.
> Thanks,
> Mohit Agrawal
>
> On Wed, Apr 3, 2019 at 3:58 PM Olaf Buitelaar 
> wrote:
>
>> Dear Mohit,
>>
>> Sorry i thought Krutika was referring to the ovirt-kube brick logs. due
>> the large size (18MB compressed), i've placed the files here;
>> https://edgecastcdn.net/0004FA/files/bricklogs.tar.bz2
>> Also i see i've attached the wrong files, i intended to
>> attach profile_data4.txt | profile_data3.txt
>> Sorry for the confusion.
>>
>> Thanks Olaf
>>
>> Op wo 3 apr. 2019 om 04:56 schreef Mohit Agrawal :
>>
>>> Hi Olaf,
>>>
>>>   As per current attached "multi-glusterfsd-vol3.txt |
>>> multi-glusterfsd-vol4.txt" it is showing multiple processes are running
>>>   for "ovirt-core ovirt-engine" brick names but there are no logs
>>> available in bricklogs.zip specific to this bricks, bricklogs.zip
>>>   has a dump of ovirt-kube logs only
>>>
>>>   Kindly share brick logs specific to the bricks "ovirt-core
>>> ovirt-engine" and share glusterd logs also.
>>>
>>> Regards
>>> Mohit Agrawal
>>>
>>> On Tue, Apr 2, 2019 at 9:18 PM Olaf Buitelaar 
>>> wrote:
>>>
 Dear Krutika,

 1.
 I've changed the volume settings, write performance seems to increased
 somewhat, however the profile doesn't really support that since latencies
 increased. However read performance has diminished, which does seem to be
 supported by the profile runs (attached).
 Also the IO does seem to behave more consistent than before.
 I don't really understand the idea behind them, maybe you can explain
 why these suggestions are good?
 These settings seems to avoid as much local caching and access as
 possible and push everything to the gluster processes. While i would expect
 local access and local caches are a good thing, since it would lead to
 having less network access or disk access.
 I tried to investigate these settings a bit more, and this is what i
 understood of them;
 - network.remote-dio; when on it seems to ignore the O_DIRECT flag in
 the client, thus causing the files to be cached and buffered in the page
 cache on the client, i would expect this to be a good thing especially if
 the server process would access the same page cache?
 At least that is what grasp from this commit;
 https://review.gluster.org/#/c/glusterfs/+/4206/2/xlators/protocol/client/src/client.c
  line
 867
 Also found this commit;
 https://github.com/gluster/glusterfs/commit/06c4ba589102bf92c58cd9fba5c60064bc7a504e#diff-938709e499b4383c3ed33c3979b9080c
  suggesting
 remote-dio actually improves performance, not sure it's a write or read
 benchmark
 When a file is opened with O_DIRECT it will also disable the
 write-behind functionality

 - performance.strict-o-direct: when on, the AFR, will not ignore the
 O_DIRECT flag. and will invoke: fop_writev_stub with the wb_writev_helper,
 which seems to stack the 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-04-03 Thread Mohit Agrawal
Hi Olaf,

  As per current attached "multi-glusterfsd-vol3.txt |
multi-glusterfsd-vol4.txt" it is showing multiple processes are running
  for "ovirt-core ovirt-engine" brick names but there are no logs available
in bricklogs.zip specific to this bricks, bricklogs.zip
  has a dump of ovirt-kube logs only

  Kindly share brick logs specific to the bricks "ovirt-core  ovirt-engine"
and share glusterd logs also.

Regards
Mohit Agrawal

On Tue, Apr 2, 2019 at 9:18 PM Olaf Buitelaar 
wrote:

> Dear Krutika,
>
> 1.
> I've changed the volume settings, write performance seems to increased
> somewhat, however the profile doesn't really support that since latencies
> increased. However read performance has diminished, which does seem to be
> supported by the profile runs (attached).
> Also the IO does seem to behave more consistent than before.
> I don't really understand the idea behind them, maybe you can explain why
> these suggestions are good?
> These settings seems to avoid as much local caching and access as possible
> and push everything to the gluster processes. While i would expect local
> access and local caches are a good thing, since it would lead to having
> less network access or disk access.
> I tried to investigate these settings a bit more, and this is what i
> understood of them;
> - network.remote-dio; when on it seems to ignore the O_DIRECT flag in the
> client, thus causing the files to be cached and buffered in the page cache
> on the client, i would expect this to be a good thing especially if the
> server process would access the same page cache?
> At least that is what grasp from this commit;
> https://review.gluster.org/#/c/glusterfs/+/4206/2/xlators/protocol/client/src/client.c
>  line
> 867
> Also found this commit;
> https://github.com/gluster/glusterfs/commit/06c4ba589102bf92c58cd9fba5c60064bc7a504e#diff-938709e499b4383c3ed33c3979b9080c
>  suggesting
> remote-dio actually improves performance, not sure it's a write or read
> benchmark
> When a file is opened with O_DIRECT it will also disable the write-behind
> functionality
>
> - performance.strict-o-direct: when on, the AFR, will not ignore the
> O_DIRECT flag. and will invoke: fop_writev_stub with the wb_writev_helper,
> which seems to stack the operation, no idea why that is. But generally i
> suppose not ignoring the O_DIRECT flag in the AFR is a good thing, when a
> processes requests to have O_DIRECT. So this makes sense to me.
>
> - cluster.choose-local: when off, it doesn't prefer the local node, but
> would always choose a brick. Since it's a 9 node cluster, with 3
> subvolumes, only a 1/3 could end-up local, and the other 2/3 should be
> pushed to external nodes anyway. Or am I making the total wrong assumption
> here?
>
> It seems to this config is moving to the gluster-block config side of
> things, which does make sense.
> Since we're running quite some mysql instances, which opens the files with
> O_DIRECt i believe, it would mean the only layer of cache is within mysql
> it self. Which you could argue is a good thing. But i would expect a little
> of write-behind buffer, and maybe some of the data cached within gluster
> would alleviate things a bit on gluster's side. But i wouldn't know if
> that's the correct mind set, and so might be totally off here.
> Also i would expect these gluster v set  command to be online
> operations, but somehow the bricks went down, after applying these changes.
> What appears to have happened is that after the update the brick process
> was restarted, but due to multiple brick process start issue, multiple
> processes were started, and the brick didn't came online again.
> However i'll try to reproduce this, since i would like to test with
> cluster.choose-local: on, and see how performance compares. And hopefully
> when it occurs collect some useful info.
> Question; are network.remote-dio and performance.strict-o-direct mutually
> exclusive settings, or can they both be on?
>
> 2. I've attached all brick logs, the only thing relevant i found was;
> [2019-03-28 20:20:07.170452] I [MSGID: 113030]
> [posix-entry-ops.c:1146:posix_unlink] 0-ovirt-kube-posix:
> open-fd-key-status: 0 for
> /data/gfs/bricks/brick1/ovirt-kube/.shard/a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.109886
> [2019-03-28 20:20:07.170491] I [MSGID: 113031]
> [posix-entry-ops.c:1053:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr
> status: 0 for
> /data/gfs/bricks/brick1/ovirt-kube/.shard/a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.109886
> [2019-03-28 20:20:07.248480] I [MSGID: 113030]
> [posix-entry-ops.c:1146:posix_unlink] 0-ovirt-kube-posix:
> open-fd-key-status: 0 for
> /data/gfs/bricks/brick1/ovirt-kube/.shard/a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.109886
> [2019-03-28 20:20:07.248491] I [MSGID: 113031]
> [posix-entry-ops.c:1053:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr
> status: 0 for
> /data/gfs/bricks/brick1/ovirt-kube/.shard/a38d64bc-a28b-4ee1-a0bb-f919e7a1022c.109886
>
> Thanks Olaf
>
> ps. sorry needed to 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-31 Thread Krutika Dhananjay
Adding back gluster-users
Comments inline ...

On Fri, Mar 29, 2019 at 8:11 PM Olaf Buitelaar 
wrote:

> Dear Krutika,
>
>
>
> 1. I’ve made 2 profile runs of around 10 minutes (see files
> profile_data.txt and profile_data2.txt). Looking at it, most time seems be
> spent at the  fop’s fsync and readdirp.
>
> Unfortunate I don’t have the profile info for the 3.12.15 version so it’s
> a bit hard to compare.
>
> One additional thing I do notice on 1 machine (10.32.9.5) the iowait time
> increased a lot, from an average below the 1% it’s now around the 12% after
> the upgrade.
>
> So first suspicion with be lighting strikes twice, and I’ve also just now
> a bad disk, but that doesn’t appear to be the case, since all smart status
> report ok.
>
> Also dd shows performance I would more or less expect;
>
> dd if=/dev/zero of=/data/test_file  bs=100M count=1  oflag=dsync
>
> 1+0 records in
>
> 1+0 records out
>
> 104857600 bytes (105 MB) copied, 0.686088 s, 153 MB/s
>
> dd if=/dev/zero of=/data/test_file  bs=1G count=1  oflag=dsync
>
> 1+0 records in
>
> 1+0 records out
>
> 1073741824 bytes (1.1 GB) copied, 7.61138 s, 141 MB/s
>
> if=/dev/urandom of=/data/test_file  bs=1024 count=100
>
> 100+0 records in
>
> 100+0 records out
>
> 102400 bytes (1.0 GB) copied, 6.35051 s, 161 MB/s
>
> dd if=/dev/zero of=/data/test_file  bs=1024 count=100
>
> 100+0 records in
>
> 100+0 records out
>
> 102400 bytes (1.0 GB) copied, 1.6899 s, 606 MB/s
>
> When I disable this brick (service glusterd stop; pkill glusterfsd)
> performance in gluster is better, but not on par with what it was. Also the
> cpu usages on the “neighbor” nodes which hosts the other bricks in the same
> subvolume increases quite a lot in this case, which I wouldn’t expect
> actually since they shouldn't handle much more work, except flagging shards
> to heal. Iowait  also goes to idle once gluster is stopped, so it’s for
> sure gluster which waits for io.
>
>
>

So I see that FSYNC %-latency is on the higher side. And I also noticed you
don't have direct-io options enabled on the volume.
Could you set the following options on the volume -
# gluster volume set  network.remote-dio off
# gluster volume set  performance.strict-o-direct on
and also disable choose-local
# gluster volume set  cluster.choose-local off

let me know if this helps.

2. I’ve attached the mnt log and volume info, but I couldn’t find anything
> relevant in in those logs. I think this is because we run the VM’s with
> libgfapi;
>
> [root@ovirt-host-01 ~]# engine-config  -g LibgfApiSupported
>
> LibgfApiSupported: true version: 4.2
>
> LibgfApiSupported: true version: 4.1
>
> LibgfApiSupported: true version: 4.3
>
> And I can confirm the qemu process is invoked with the gluster:// address
> for the images.
>
> The message is logged in the /var/lib/libvert/qemu/  file, which
> I’ve also included. For a sample case see around; 2019-03-28 20:20:07
>
> Which has the error; E [MSGID: 133010]
> [shard.c:2294:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on
> shard 109886 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c
> [Stale file handle]
>

Could you also attach the brick logs for this volume?


>
> 3. yes I see multiple instances for the same brick directory, like;
>
> /usr/sbin/glusterfsd -s 10.32.9.6 --volfile-id
> ovirt-core.10.32.9.6.data-gfs-bricks-brick1-ovirt-core -p
> /var/run/gluster/vols/ovirt-core/10.32.9.6-data-gfs-bricks-brick1-ovirt-core.pid
> -S /var/run/gluster/452591c9165945d9.socket --brick-name
> /data/gfs/bricks/brick1/ovirt-core -l
> /var/log/glusterfs/bricks/data-gfs-bricks-brick1-ovirt-core.log
> --xlator-option *-posix.glusterd-uuid=fb513da6-f3bd-4571-b8a2-db5efaf60cc1
> --process-name brick --brick-port 49154 --xlator-option
> ovirt-core-server.listen-port=49154
>
>
>
> I’ve made an export of the output of ps from the time I observed these
> multiple processes.
>
> In addition the brick_mux bug as noted by Atin. I might also have another
> possible cause, as ovirt moves nodes from none-operational state or
> maintenance state to active/activating, it also seems to restart gluster,
> however I don’t have direct proof for this theory.
>
>
>

+Atin Mukherjee  ^^
+Mohit Agrawal   ^^

-Krutika

Thanks Olaf
>
> Op vr 29 mrt. 2019 om 10:03 schreef Sandro Bonazzola  >:
>
>>
>>
>> Il giorno gio 28 mar 2019 alle ore 17:48  ha
>> scritto:
>>
>>> Dear All,
>>>
>>> I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While
>>> previous upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a
>>> different experience. After first trying a test upgrade on a 3 node setup,
>>> which went fine. i headed to upgrade the 9 node production platform,
>>> unaware of the backward compatibility issues between gluster 3.12.15 ->
>>> 5.3. After upgrading 2 nodes, the HA engine stopped and wouldn't start.
>>> Vdsm wasn't able to mount the engine storage domain, since /dom_md/metadata
>>> was missing or couldn't be accessed. 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-29 Thread Darrell Budic
I’ve also encounter multiple brick processes (glusterfsd) being spawned per 
brick directory on gluster 5.5 while upgrading from 3.12.15. In my case, it’s 
on a stand alone server cluster that doesn’t have ovirt installed, so it seems 
to be gluster itself. 

Haven’t had the chance to followup on some bug reports yet, but hopefully in 
the next day or so...

> On Mar 29, 2019, at 9:39 AM, Olaf Buitelaar  wrote:
> 
> Dear Krutika,
>  
> 1. I’ve made 2 profile runs of around 10 minutes (see files profile_data.txt 
> and profile_data2.txt). Looking at it, most time seems be spent at the  fop’s 
> fsync and readdirp.
> Unfortunate I don’t have the profile info for the 3.12.15 version so it’s a 
> bit hard to compare.
> One additional thing I do notice on 1 machine (10.32.9.5) the iowait time 
> increased a lot, from an average below the 1% it’s now around the 12% after 
> the upgrade.
> So first suspicion with be lighting strikes twice, and I’ve also just now a 
> bad disk, but that doesn’t appear to be the case, since all smart status 
> report ok.
> Also dd shows performance I would more or less expect;
> dd if=/dev/zero of=/data/test_file  bs=100M count=1  oflag=dsync
> 1+0 records in
> 1+0 records out
> 104857600 bytes (105 MB) copied, 0.686088 s, 153 MB/s
> dd if=/dev/zero of=/data/test_file  bs=1G count=1  oflag=dsync
> 1+0 records in
> 1+0 records out
> 1073741824 bytes (1.1 GB) copied, 7.61138 s, 141 MB/s
> if=/dev/urandom of=/data/test_file  bs=1024 count=100
> 100+0 records in
> 100+0 records out
> 102400 bytes (1.0 GB) copied, 6.35051 s, 161 MB/s
> dd if=/dev/zero of=/data/test_file  bs=1024 count=100
> 100+0 records in
> 100+0 records out
> 102400 bytes (1.0 GB) copied, 1.6899 s, 606 MB/s
> When I disable this brick (service glusterd stop; pkill glusterfsd) 
> performance in gluster is better, but not on par with what it was. Also the 
> cpu usages on the “neighbor” nodes which hosts the other bricks in the same 
> subvolume increases quite a lot in this case, which I wouldn’t expect 
> actually since they shouldn't handle much more work, except flagging shards 
> to heal. Iowait  also goes to idle once gluster is stopped, so it’s for sure 
> gluster which waits for io.
>  
> 2. I’ve attached the mnt log and volume info, but I couldn’t find anything 
> relevant in in those logs. I think this is because we run the VM’s with 
> libgfapi;
> [root@ovirt-host-01 ~]# engine-config  -g LibgfApiSupported
> LibgfApiSupported: true version: 4.2
> LibgfApiSupported: true version: 4.1
> LibgfApiSupported: true version: 4.3
> And I can confirm the qemu process is invoked with the gluster:// address for 
> the images.
> The message is logged in the /var/lib/libvert/qemu/  file, which 
> I’ve also included. For a sample case see around; 2019-03-28 20:20:07
> Which has the error; E [MSGID: 133010] 
> [shard.c:2294:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard: Lookup on 
> shard 109886 failed. Base file gfid = a38d64bc-a28b-4ee1-a0bb-f919e7a1022c 
> [Stale file handle]
>  
> 3. yes I see multiple instances for the same brick directory, like;
> /usr/sbin/glusterfsd -s 10.32.9.6 --volfile-id 
> ovirt-core.10.32.9.6.data-gfs-bricks-brick1-ovirt-core -p 
> /var/run/gluster/vols/ovirt-core/10.32.9.6-data-gfs-bricks-brick1-ovirt-core.pid
>  -S /var/run/gluster/452591c9165945d9.socket --brick-name 
> /data/gfs/bricks/brick1/ovirt-core -l 
> /var/log/glusterfs/bricks/data-gfs-bricks-brick1-ovirt-core.log 
> --xlator-option *-posix.glusterd-uuid=fb513da6-f3bd-4571-b8a2-db5efaf60cc1 
> --process-name brick --brick-port 49154 --xlator-option 
> ovirt-core-server.listen-port=49154
>  
> I’ve made an export of the output of ps from the time I observed these 
> multiple processes.
> In addition the brick_mux bug as noted by Atin. I might also have another 
> possible cause, as ovirt moves nodes from none-operational state or 
> maintenance state to active/activating, it also seems to restart gluster, 
> however I don’t have direct proof for this theory.
>  
> Thanks Olaf
> 
> Op vr 29 mrt. 2019 om 10:03 schreef Sandro Bonazzola  >:
> 
> 
> Il giorno gio 28 mar 2019 alle ore 17:48  > ha scritto:
> Dear All,
> 
> I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While previous 
> upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a different 
> experience. After first trying a test upgrade on a 3 node setup, which went 
> fine. i headed to upgrade the 9 node production platform, unaware of the 
> backward compatibility issues between gluster 3.12.15 -> 5.3. After upgrading 
> 2 nodes, the HA engine stopped and wouldn't start. Vdsm wasn't able to mount 
> the engine storage domain, since /dom_md/metadata was missing or couldn't be 
> accessed. Restoring this file by getting a good copy of the underlying 
> bricks, removing the file from the underlying bricks where the file was 0 
> bytes and mark with the 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-29 Thread Sandro Bonazzola
Il giorno gio 28 mar 2019 alle ore 17:48  ha
scritto:

> Dear All,
>
> I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While
> previous upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a
> different experience. After first trying a test upgrade on a 3 node setup,
> which went fine. i headed to upgrade the 9 node production platform,
> unaware of the backward compatibility issues between gluster 3.12.15 ->
> 5.3. After upgrading 2 nodes, the HA engine stopped and wouldn't start.
> Vdsm wasn't able to mount the engine storage domain, since /dom_md/metadata
> was missing or couldn't be accessed. Restoring this file by getting a good
> copy of the underlying bricks, removing the file from the underlying bricks
> where the file was 0 bytes and mark with the stickybit, and the
> corresponding gfid's. Removing the file from the mount point, and copying
> back the file on the mount point. Manually mounting the engine domain,  and
> manually creating the corresponding symbolic links in /rhev/data-center and
> /var/run/vdsm/storage and fixing the ownership back to vdsm.kvm (which was
> root.root), i was able to start the HA engine again. Since the engine was
> up again, and things seemed rather unstable i decided to continue the
> upgrade on the other nodes suspecting an incompatibility in gluster
> versions, i thought would be best to have them all on the same version
> rather soonish. However things went from bad to worse, the engine stopped
> again, and all vm’s stopped working as well.  So on a machine outside the
> setup and restored a backup of the engine taken from version 4.2.8 just
> before the upgrade. With this engine I was at least able to start some vm’s
> again, and finalize the upgrade. Once the upgraded, things didn’t stabilize
> and also lose 2 vm’s during the process due to image corruption. After
> figuring out gluster 5.3 had quite some issues I was as lucky to see
> gluster 5.5 was about to be released, on the moment the RPM’s were
> available I’ve installed those. This helped a lot in terms of stability,
> for which I’m very grateful! However the performance is unfortunate
> terrible, it’s about 15% of what the performance was running gluster
> 3.12.15. It’s strange since a simple dd shows ok performance, but our
> actual workload doesn’t. While I would expect the performance to be better,
> due to all improvements made since gluster version 3.12. Does anybody share
> the same experience?
> I really hope gluster 6 will soon be tested with ovirt and released, and
> things start to perform and stabilize again..like the good old days. Of
> course when I can do anything, I’m happy to help.
>

Opened https://bugzilla.redhat.com/show_bug.cgi?id=1693998 to track the
rebase on Gluster 6.



>
> I think the following short list of issues we have after the migration;
> Gluster 5.5;
> -   Poor performance for our workload (mostly write dependent)
> -   VM’s randomly pause on unknown storage errors, which are “stale
> file’s”. corresponding log; Lookup on shard 797 failed. Base file gfid =
> 8a27b91a-ff02-42dc-bd4c-caa019424de8 [Stale file handle]
> -   Some files are listed twice in a directory (probably related the
> stale file issue?)
> Example;
> ls -la
> /rhev/data-center/59cd53a9-0003-02d7-00eb-01e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/
> total 3081
> drwxr-x---.  2 vdsm kvm4096 Mar 18 11:34 .
> drwxr-xr-x. 13 vdsm kvm4096 Mar 19 09:42 ..
> -rw-rw.  1 vdsm kvm 1048576 Mar 28 12:55
> 1a7cf259-6b29-421d-9688-b25dfaafb13c
> -rw-rw.  1 vdsm kvm 1048576 Mar 28 12:55
> 1a7cf259-6b29-421d-9688-b25dfaafb13c
> -rw-rw.  1 vdsm kvm 1048576 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.lease
> -rw-r--r--.  1 vdsm kvm 290 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
> -rw-r--r--.  1 vdsm kvm 290 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
>
> - brick processes sometimes starts multiple times. Sometimes I’ve 5 brick
> processes for a single volume. Killing all glusterfsd’s for the volume on
> the machine and running gluster v start  force usually just starts one
> after the event, from then on things look all right.
>
>
May I kindly ask to open bugs on Gluster for above issues at
https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS ?
Sahina?


> Ovirt 4.3.2.1-1.el7
> -   All vms images ownership are changed to root.root after the vm is
> shutdown, probably related to;
> https://bugzilla.redhat.com/show_bug.cgi?id=1666795 but not only scoped
> to the HA engine. I’m still in compatibility mode 4.2 for the cluster and
> for the vm’s, but upgraded to version ovirt 4.3.2
>

Ryan?


> -   The network provider is set to ovn, which is fine..actually cool,
> only the “ovs-vswitchd” is a CPU hog, and utilizes 100%
>

Miguel? Dominik?


> -   It seems on all nodes vdsm tries to get the the stats for the HA
> engine, which is filling the logs with (not sure 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-29 Thread Krutika Dhananjay
Questions/comments inline ...

On Thu, Mar 28, 2019 at 10:18 PM  wrote:

> Dear All,
>
> I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While
> previous upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a
> different experience. After first trying a test upgrade on a 3 node setup,
> which went fine. i headed to upgrade the 9 node production platform,
> unaware of the backward compatibility issues between gluster 3.12.15 ->
> 5.3. After upgrading 2 nodes, the HA engine stopped and wouldn't start.
> Vdsm wasn't able to mount the engine storage domain, since /dom_md/metadata
> was missing or couldn't be accessed. Restoring this file by getting a good
> copy of the underlying bricks, removing the file from the underlying bricks
> where the file was 0 bytes and mark with the stickybit, and the
> corresponding gfid's. Removing the file from the mount point, and copying
> back the file on the mount point. Manually mounting the engine domain,  and
> manually creating the corresponding symbolic links in /rhev/data-center and
> /var/run/vdsm/storage and fixing the ownership back to vdsm.kvm (which was
> root.root), i was able to start the HA engine again. Since the engine was
> up again, and things seemed rather unstable i decided to continue the
> upgrade on the other nodes suspecting an incompatibility in gluster
> versions, i thought would be best to have them all on the same version
> rather soonish. However things went from bad to worse, the engine stopped
> again, and all vm’s stopped working as well.  So on a machine outside the
> setup and restored a backup of the engine taken from version 4.2.8 just
> before the upgrade. With this engine I was at least able to start some vm’s
> again, and finalize the upgrade. Once the upgraded, things didn’t stabilize
> and also lose 2 vm’s during the process due to image corruption. After
> figuring out gluster 5.3 had quite some issues I was as lucky to see
> gluster 5.5 was about to be released, on the moment the RPM’s were
> available I’ve installed those. This helped a lot in terms of stability,
> for which I’m very grateful! However the performance is unfortunate
> terrible, it’s about 15% of what the performance was running gluster
> 3.12.15. It’s strange since a simple dd shows ok performance, but our
> actual workload doesn’t. While I would expect the performance to be better,
> due to all improvements made since gluster version 3.12. Does anybody share
> the same experience?
> I really hope gluster 6 will soon be tested with ovirt and released, and
> things start to perform and stabilize again..like the good old days. Of
> course when I can do anything, I’m happy to help.
>
> I think the following short list of issues we have after the migration;
> Gluster 5.5;
> -   Poor performance for our workload (mostly write dependent)
>

For this, could you share the volume-profile output specifically for the
affected volume(s)? Here's what you need to do -

1. # gluster volume profile $VOLNAME stop
2. # gluster volume profile $VOLNAME start
3. Run the test inside the vm wherein you see bad performance
4. # gluster volume profile $VOLNAME info # save the output of this command
into a file
5. # gluster volume profile $VOLNAME stop
6. and attach the output file gotten in step 4

-   VM’s randomly pause on un
>
known storage errors, which are “stale file’s”. corresponding log; Lookup
> on shard 797 failed. Base file gfid = 8a27b91a-ff02-42dc-bd4c-caa019424de8
> [Stale file handle]
>

Could you share the complete gluster client log file (it would be a
filename matching the pattern rhev-data-center-mnt-glusterSD-*)
Also the output of `gluster volume info $VOLNAME`



> -   Some files are listed twice in a directory (probably related the
> stale file issue?)
> Example;
> ls -la
> /rhev/data-center/59cd53a9-0003-02d7-00eb-01e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/
> total 3081
> drwxr-x---.  2 vdsm kvm4096 Mar 18 11:34 .
> drwxr-xr-x. 13 vdsm kvm4096 Mar 19 09:42 ..
> -rw-rw.  1 vdsm kvm 1048576 Mar 28 12:55
> 1a7cf259-6b29-421d-9688-b25dfaafb13c
> -rw-rw.  1 vdsm kvm 1048576 Mar 28 12:55
> 1a7cf259-6b29-421d-9688-b25dfaafb13c
> -rw-rw.  1 vdsm kvm 1048576 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.lease
> -rw-r--r--.  1 vdsm kvm 290 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
> -rw-r--r--.  1 vdsm kvm 290 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
>

Adding DHT and readdir-ahead maintainers regarding entries getting listed
twice.
@Nithya Balachandran  ^^
@Gowdappa, Raghavendra  ^^
@Poornima Gurusiddaiah  ^^


>
> - brick processes sometimes starts multiple times. Sometimes I’ve 5 brick
> processes for a single volume. Killing all glusterfsd’s for the volume on
> the machine and running gluster v start  force usually just starts one
> after the event, from then on things look all right.
>

Did you mean 5 brick processes for 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-28 Thread Leo David
Olaf, thank you very much for this feedback, I was just about to upgrade my
12 nodes 4.2.8 production cluster. And it seem so that you speared me of a
lot of trouble.
Though, I thought that 4.3.1 comes with gluster 5.5 which has been solved
the issues, and the upgrade procedure works seemless.
Not sure now how long or what oVirt version to wait for before upgrading my
cluster...

On Thu, Mar 28, 2019, 18:48  wrote:

> Dear All,
>
> I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While
> previous upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a
> different experience. After first trying a test upgrade on a 3 node setup,
> which went fine. i headed to upgrade the 9 node production platform,
> unaware of the backward compatibility issues between gluster 3.12.15 ->
> 5.3. After upgrading 2 nodes, the HA engine stopped and wouldn't start.
> Vdsm wasn't able to mount the engine storage domain, since /dom_md/metadata
> was missing or couldn't be accessed. Restoring this file by getting a good
> copy of the underlying bricks, removing the file from the underlying bricks
> where the file was 0 bytes and mark with the stickybit, and the
> corresponding gfid's. Removing the file from the mount point, and copying
> back the file on the mount point. Manually mounting the engine domain,  and
> manually creating the corresponding symbolic links in /rhev/data-center and
> /var/run/vdsm/storage and fixing the ownership back to vdsm.kvm (which was
> root.root), i was able to start the HA engine again. Since the engine was
> up again, and things seemed rather unstable i decided to continue the
> upgrade on the other nodes suspecting an incompatibility in gluster
> versions, i thought would be best to have them all on the same version
> rather soonish. However things went from bad to worse, the engine stopped
> again, and all vm’s stopped working as well.  So on a machine outside the
> setup and restored a backup of the engine taken from version 4.2.8 just
> before the upgrade. With this engine I was at least able to start some vm’s
> again, and finalize the upgrade. Once the upgraded, things didn’t stabilize
> and also lose 2 vm’s during the process due to image corruption. After
> figuring out gluster 5.3 had quite some issues I was as lucky to see
> gluster 5.5 was about to be released, on the moment the RPM’s were
> available I’ve installed those. This helped a lot in terms of stability,
> for which I’m very grateful! However the performance is unfortunate
> terrible, it’s about 15% of what the performance was running gluster
> 3.12.15. It’s strange since a simple dd shows ok performance, but our
> actual workload doesn’t. While I would expect the performance to be better,
> due to all improvements made since gluster version 3.12. Does anybody share
> the same experience?
> I really hope gluster 6 will soon be tested with ovirt and released, and
> things start to perform and stabilize again..like the good old days. Of
> course when I can do anything, I’m happy to help.
>
> I think the following short list of issues we have after the migration;
> Gluster 5.5;
> -   Poor performance for our workload (mostly write dependent)
> -   VM’s randomly pause on unknown storage errors, which are “stale
> file’s”. corresponding log; Lookup on shard 797 failed. Base file gfid =
> 8a27b91a-ff02-42dc-bd4c-caa019424de8 [Stale file handle]
> -   Some files are listed twice in a directory (probably related the
> stale file issue?)
> Example;
> ls -la
> /rhev/data-center/59cd53a9-0003-02d7-00eb-01e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/
> total 3081
> drwxr-x---.  2 vdsm kvm4096 Mar 18 11:34 .
> drwxr-xr-x. 13 vdsm kvm4096 Mar 19 09:42 ..
> -rw-rw.  1 vdsm kvm 1048576 Mar 28 12:55
> 1a7cf259-6b29-421d-9688-b25dfaafb13c
> -rw-rw.  1 vdsm kvm 1048576 Mar 28 12:55
> 1a7cf259-6b29-421d-9688-b25dfaafb13c
> -rw-rw.  1 vdsm kvm 1048576 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.lease
> -rw-r--r--.  1 vdsm kvm 290 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
> -rw-r--r--.  1 vdsm kvm 290 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
>
> - brick processes sometimes starts multiple times. Sometimes I’ve 5 brick
> processes for a single volume. Killing all glusterfsd’s for the volume on
> the machine and running gluster v start  force usually just starts one
> after the event, from then on things look all right.
>
> Ovirt 4.3.2.1-1.el7
> -   All vms images ownership are changed to root.root after the vm is
> shutdown, probably related to;
> https://bugzilla.redhat.com/show_bug.cgi?id=1666795 but not only scoped
> to the HA engine. I’m still in compatibility mode 4.2 for the cluster and
> for the vm’s, but upgraded to version ovirt 4.3.2
> -   The network provider is set to ovn, which is fine..actually cool,
> only the “ovs-vswitchd” is a CPU hog, and utilizes 100%
> -   It seems on all 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-28 Thread olaf . buitelaar
Forgot one more issue with ovirt, on some hypervisor nodes we also run docker, 
it appears vdsm tries to get an hold of the interfaces docker creates/removes 
and this is spamming the vdsm and engine logs with;
Get Host Statistics failed: Internal JSON-RPC error: {'reason': '[Errno 19] 
veth7611c53 is not present in the system'}
Couldn’t really find a way to let vdsm ignore those interfaces.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/KTAVSJKLVHF7EVPKAJFXPRAJPL6Z5KYZ/


[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-28 Thread olaf . buitelaar
Dear All,

I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While previous 
upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a different 
experience. After first trying a test upgrade on a 3 node setup, which went 
fine. i headed to upgrade the 9 node production platform, unaware of the 
backward compatibility issues between gluster 3.12.15 -> 5.3. After upgrading 2 
nodes, the HA engine stopped and wouldn't start. Vdsm wasn't able to mount the 
engine storage domain, since /dom_md/metadata was missing or couldn't be 
accessed. Restoring this file by getting a good copy of the underlying bricks, 
removing the file from the underlying bricks where the file was 0 bytes and 
mark with the stickybit, and the corresponding gfid's. Removing the file from 
the mount point, and copying back the file on the mount point. Manually 
mounting the engine domain,  and manually creating the corresponding symbolic 
links in /rhev/data-center and /var/run/vdsm/storage and fixing the ownership 
back to vdsm.kvm (which was root.root), i was able to start the HA engine 
again. Since the engine was up again, and things seemed rather unstable i 
decided to continue the upgrade on the other nodes suspecting an 
incompatibility in gluster versions, i thought would be best to have them all 
on the same version rather soonish. However things went from bad to worse, the 
engine stopped again, and all vm’s stopped working as well.  So on a machine 
outside the setup and restored a backup of the engine taken from version 4.2.8 
just before the upgrade. With this engine I was at least able to start some 
vm’s again, and finalize the upgrade. Once the upgraded, things didn’t 
stabilize and also lose 2 vm’s during the process due to image corruption. 
After figuring out gluster 5.3 had quite some issues I was as lucky to see 
gluster 5.5 was about to be released, on the moment the RPM’s were available 
I’ve installed those. This helped a lot in terms of stability, for which I’m 
very grateful! However the performance is unfortunate terrible, it’s about 15% 
of what the performance was running gluster 3.12.15. It’s strange since a 
simple dd shows ok performance, but our actual workload doesn’t. While I would 
expect the performance to be better, due to all improvements made since gluster 
version 3.12. Does anybody share the same experience?
I really hope gluster 6 will soon be tested with ovirt and released, and things 
start to perform and stabilize again..like the good old days. Of course when I 
can do anything, I’m happy to help.

I think the following short list of issues we have after the migration;
Gluster 5.5;
-   Poor performance for our workload (mostly write dependent)
-   VM’s randomly pause on unknown storage errors, which are “stale 
file’s”. corresponding log; Lookup on shard 797 failed. Base file gfid = 
8a27b91a-ff02-42dc-bd4c-caa019424de8 [Stale file handle]
-   Some files are listed twice in a directory (probably related the stale 
file issue?)
Example;
ls -la  
/rhev/data-center/59cd53a9-0003-02d7-00eb-01e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/
total 3081
drwxr-x---.  2 vdsm kvm4096 Mar 18 11:34 .
drwxr-xr-x. 13 vdsm kvm4096 Mar 19 09:42 ..
-rw-rw.  1 vdsm kvm 1048576 Mar 28 12:55 
1a7cf259-6b29-421d-9688-b25dfaafb13c
-rw-rw.  1 vdsm kvm 1048576 Mar 28 12:55 
1a7cf259-6b29-421d-9688-b25dfaafb13c
-rw-rw.  1 vdsm kvm 1048576 Jan 27  2018 
1a7cf259-6b29-421d-9688-b25dfaafb13c.lease
-rw-r--r--.  1 vdsm kvm 290 Jan 27  2018 
1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
-rw-r--r--.  1 vdsm kvm 290 Jan 27  2018 
1a7cf259-6b29-421d-9688-b25dfaafb13c.meta

- brick processes sometimes starts multiple times. Sometimes I’ve 5 brick 
processes for a single volume. Killing all glusterfsd’s for the volume on the 
machine and running gluster v start  force usually just starts one after 
the event, from then on things look all right. 

Ovirt 4.3.2.1-1.el7
-   All vms images ownership are changed to root.root after the vm is 
shutdown, probably related to; 
https://bugzilla.redhat.com/show_bug.cgi?id=1666795 but not only scoped to the 
HA engine. I’m still in compatibility mode 4.2 for the cluster and for the 
vm’s, but upgraded to version ovirt 4.3.2
-   The network provider is set to ovn, which is fine..actually cool, only 
the “ovs-vswitchd” is a CPU hog, and utilizes 100%
-   It seems on all nodes vdsm tries to get the the stats for the HA 
engine, which is filling the logs with (not sure if this is new);
[api.virt] FINISH getStats return={'status': {'message': "Virtual machine does 
not exist: {'vmId': u'20d69acd-edfd-4aeb-a2ae-49e9c121b7e9'}", 'code': 1}} 
from=::1,59290, vmId=20d69acd-edfd-4aeb-a2ae-49e9c121b7e9 (api:54)
-   It seems the package os_brick [root] managedvolume not supported: 
Managed Volume Not Supported. Missing package os-brick.: ('Cannot import 
os_brick',) (caps:149)  which 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-26 Thread Darrell Budic
Following up on this, my test/dev cluster is now completely upgraded to ovirt 
4.3.2-1 and gluster5.5 and I’ve bumped the op-version on the gluster volumes. 
It’s behaving normally and gluster is happy, no excessive healing or crashing 
bricks. 

I did encounter https://bugzilla.redhat.com/show_bug.cgi?id=1677160 
 on my production cluster 
(with gluster 5.5 clients and 3.12.15 servers) and am proceeding to upgrade my 
gluster servers to 5.5 now that I’m happy with it on my dev cluster. A little 
quicker that I’d like, but it seems to be behaving and I was also in the middle 
of adding disk to my servers, and have to restart them (or at least gluster), 
so I’m going for it.

After I finish this, I’ll test gluster 6 out.

  -Darrell



> On Mar 25, 2019, at 11:04 AM, Darrell Budic  wrote:
> 
> I’m not quite done with my test upgrade to ovirt 4.3.x with gluster 5.5, but 
> so far it’s looking good. I have NOT encountered the upgrade bugs listed as 
> resolved in the 5.5 release notes. Strahil, I didn’t encounter the brick 
> death issue and don’t have a bug ID handy for it, but so far I haven’t had 
> any bricks die. I’m moving the last node of my hyperconverged test 
> environment over today, and will followup again tomorrow on it.
> 
> Separately, I upgraded my production nodes from ovirt 4.3.1 to 4.3.2 (they 
> have a separate gluster server cluster which is still on 3.12.15), which 
> seems to have moved to the gluster 5.3.2 release. While 5.3.0 clients were 
> not having any trouble talking to my 3.12.15 servers, 5.3.2 hit 
> https://bugzilla.redhat.com/show_bug.cgi?id=1651246 
> , causing disconnects to 
> one of my servers (but only one, oddly enough), raising the load on my other 
> two servers and causing a lot of continuous healing. This lead to some 
> stability issues with my hosted engine and general sluggishness of the ovirt 
> UI. I also experienced problems migrating from 4.3.1 nodes, but that seems to 
> have been related to the underlying gluster issues, as it seems to have 
> cleared up onceI resolved the gluster problems. Since I was testing gluster 
> 5.5 already, I moved my nodes to gluster 5.5 (instead of rolling them back) 
> as the bug above was resolved in that version. That did the trick, and my 
> cluster is back to normal and behaving properly again.
> 
> So my gluster 5.5 experience has been positive so far, and it looks like 5.3 
> is a version for laying down and avoiding. I’ll update again tomorrow, and 
> then flag the centos maintainers about 5.5 stability so it gets out of the 
> -testing repo if all continues to go well.
> 
>   -Darrell
> 
> 
>> On Mar 21, 2019, at 3:39 PM, Strahil > > wrote:
>> 
>> Hi Darrel,
>> 
>> Will it fix the cluster brick sudden death issue ?
>> 
>> Best Regards,
>> Strahil Nikolov
>> 
>> On Mar 21, 2019 21:56, Darrell Budic > > wrote:
>> This release of Gluster 5.5 appears to fix the gluster 3.12->5.3 migration 
>> problems many ovirt users have encountered. 
>> 
>> I’ll try and test it out this weekend and report back. If anyone else gets a 
>> chance to check it out, let us know how it goes!
>> 
>>   -Darrell
>> 
>> Begin forwarded message:
>> 
>> From: Shyam Ranganathan mailto:srang...@redhat.com>>
>> Subject: [Gluster-users] Announcing Gluster release 5.5
>> Date: March 21, 2019 at 6:06:33 AM CDT
>> To: annou...@gluster.org , gluster-users 
>> Discussion List > >
>> Cc: GlusterFS Maintainers > >
>> 
>> The Gluster community is pleased to announce the release of Gluster
>> 5.5 (packages available at [1]).
>> 
>> Release notes for the release can be found at [3].
>> 
>> Major changes, features and limitations addressed in this release:
>> 
>> - Release 5.4 introduced an incompatible change that prevented rolling
>> upgrades, and hence was never announced to the lists. As a result we are
>> jumping a release version and going to 5.5 from 5.3, that does not have
>> the problem.
>> 
>> Thanks,
>> Gluster community
>> 
>> [1] Packages for 5.5:
>> https://download.gluster.org/pub/gluster/glusterfs/5/5.5/ 
>> 
>> 
>> [2] Release notes for 5.5:
>> https://docs.gluster.org/en/latest/release-notes/5.5 
>> /
>> ___
>> Gluster-users mailing list
>> gluster-us...@gluster.org 
>> https://lists.gluster.org/mailman/listinfo/gluster-users 
>> 
>> 
> 
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> 

[ovirt-users] Re: [Gluster-users] Announcing Gluster release 5.5

2019-03-25 Thread Darrell Budic
I’m not quite done with my test upgrade to ovirt 4.3.x with gluster 5.5, but so 
far it’s looking good. I have NOT encountered the upgrade bugs listed as 
resolved in the 5.5 release notes. Strahil, I didn’t encounter the brick death 
issue and don’t have a bug ID handy for it, but so far I haven’t had any bricks 
die. I’m moving the last node of my hyperconverged test environment over today, 
and will followup again tomorrow on it.

Separately, I upgraded my production nodes from ovirt 4.3.1 to 4.3.2 (they have 
a separate gluster server cluster which is still on 3.12.15), which seems to 
have moved to the gluster 5.3.2 release. While 5.3.0 clients were not having 
any trouble talking to my 3.12.15 servers, 5.3.2 hit 
https://bugzilla.redhat.com/show_bug.cgi?id=1651246 
, causing disconnects to 
one of my servers (but only one, oddly enough), raising the load on my other 
two servers and causing a lot of continuous healing. This lead to some 
stability issues with my hosted engine and general sluggishness of the ovirt 
UI. I also experienced problems migrating from 4.3.1 nodes, but that seems to 
have been related to the underlying gluster issues, as it seems to have cleared 
up onceI resolved the gluster problems. Since I was testing gluster 5.5 
already, I moved my nodes to gluster 5.5 (instead of rolling them back) as the 
bug above was resolved in that version. That did the trick, and my cluster is 
back to normal and behaving properly again.

So my gluster 5.5 experience has been positive so far, and it looks like 5.3 is 
a version for laying down and avoiding. I’ll update again tomorrow, and then 
flag the centos maintainers about 5.5 stability so it gets out of the -testing 
repo if all continues to go well.

  -Darrell


> On Mar 21, 2019, at 3:39 PM, Strahil  wrote:
> 
> Hi Darrel,
> 
> Will it fix the cluster brick sudden death issue ?
> 
> Best Regards,
> Strahil Nikolov
> 
> On Mar 21, 2019 21:56, Darrell Budic  wrote:
> This release of Gluster 5.5 appears to fix the gluster 3.12->5.3 migration 
> problems many ovirt users have encountered. 
> 
> I’ll try and test it out this weekend and report back. If anyone else gets a 
> chance to check it out, let us know how it goes!
> 
>   -Darrell
> 
> Begin forwarded message:
> 
> From: Shyam Ranganathan mailto:srang...@redhat.com>>
> Subject: [Gluster-users] Announcing Gluster release 5.5
> Date: March 21, 2019 at 6:06:33 AM CDT
> To: annou...@gluster.org , gluster-users 
> Discussion List mailto:gluster-us...@gluster.org>>
> Cc: GlusterFS Maintainers  >
> 
> The Gluster community is pleased to announce the release of Gluster
> 5.5 (packages available at [1]).
> 
> Release notes for the release can be found at [3].
> 
> Major changes, features and limitations addressed in this release:
> 
> - Release 5.4 introduced an incompatible change that prevented rolling
> upgrades, and hence was never announced to the lists. As a result we are
> jumping a release version and going to 5.5 from 5.3, that does not have
> the problem.
> 
> Thanks,
> Gluster community
> 
> [1] Packages for 5.5:
> https://download.gluster.org/pub/gluster/glusterfs/5/5.5/ 
> 
> 
> [2] Release notes for 5.5:
> https://docs.gluster.org/en/latest/release-notes/5.5 
> /
> ___
> Gluster-users mailing list
> gluster-us...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users 
> 
> 

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/5SS24L27QNSR2MZEQEGKCLWAIW5DVTYX/