Dear Mohit,

I've upgraded to gluster 5.6, however the starting of multiple glusterfsd 
processed per brick doesn't seems to be fully resolved yet. However it does 
seem to happen less than before. Also in some cases glusterd did seem to detect 
a glusterfsd was running, but decided it was not valid. It was reproducible on 
all my machines after a reboot, but only a few bricks seemed to be affected. 
I'm running about 14 bricks per machine, and only 1 - 3 were affected. The ones 
with 3 full  bricks, seemed tp suffer most. Also in some cases a restart of the 
glusterd service did spawn multiple glusterfsd processed for the same bricks 
configured on the node. 

See for example logs;
[2019-04-19 17:49:50.853099] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:50:33.302239] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:56:11.287692] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:57:12.699967] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 14884 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-19 17:57:12.700150] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:02:58.420870] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:03:29.420891] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:48:14.046029] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:55:04.508606] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core

or

[2019-04-18 17:00:00.665476] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:00:32.799529] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:02:38.271880] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:08:32.867046] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:00.440336] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 9278 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:00.440476] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:07.644070] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 24126 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:07.644184] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:13.785798] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 27197 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:13.785918] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:24.344561] I [glusterd-utils.c:6184:glusterd_brick_start] 
0-management: Either pid 28468 is not running or brick path 
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:24.344675] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:37:07.150799] I [glusterd-utils.c:6214:glusterd_brick_start] 
0-management: discovered already-running brick 
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 18:17:23.203719] I [glusterd-utils.c:6301:glusterd_brick_start] 
0-management: starting a fresh brick process for brick 
/data/gfs/bricks/brick1/ovirt-core

Again the the procedure to resolve this, was kill all the glusterfsd processed 
for the brick, and do a gluster v <VOL> start force, which resulted in only 1 
processes being started.

After the upgrade to 5.6 i do notice a small performance improvement of around 
15%, but it's still far from 3.12.15. I don't experience a drop in network 
utilisation, but i doubt i ever suffered from that issue, as since as long as i 
run gluster (3.7), the usage was always between on average: 15 - 180 Mbps. And 
depending on machine and hosted bricks/brick types gravitates around 
30Mbps/80Mbps/160Mbps.

I also found the reason the ovs-vswitchd starts using 100% cpu, it appears the 
one of the machines tries to add an interface twice on all other machines. I 
don't really understand where this is configured;

801cc877-dd59-4b73-9cd4-6e89b7dd4245
    Bridge br-int
        fail_mode: secure
        Port "ovn-ab29e1-0"
            Interface "ovn-ab29e1-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.5"}
        Port "ovn-e1f5eb-0"
            Interface "ovn-e1f5eb-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.7"}
        Port "ovn-17c441-0"
            Interface "ovn-17c441-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.21"}
        Port "ovn-6a362b-0"
            Interface "ovn-6a362b-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.5"}
                error: "could not add network device ovn-6a362b-0 to ofproto 
(File exists)"
        Port "ovn-99caac-0"
            Interface "ovn-99caac-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.20"}
        Port "ovn-1c9643-0"
            Interface "ovn-1c9643-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.6"}
        Port "ovn-2e5821-0"
            Interface "ovn-2e5821-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.8"}
        Port "ovn-484b7e-0"
            Interface "ovn-484b7e-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.9"}
        Port br-int
            Interface br-int
                type: internal
        Port "ovn-0522c9-0"
            Interface "ovn-0522c9-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.32.9.4"}
        Port "ovn-437985-0"
            Interface "ovn-437985-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="10.0.6.1"}
    ovs_version: "2.10.1"     

It seems the interface for 10.32.9.5 is added twice; ovn-6a362b-0 and 
ovn-ab29e1-0. 
Manually removing the inferface with ovs-vsctl, doesn't help. The only thing 
which seems to resolve it is restarting openvswitch service on 10.32.9.5, 
however when i reboot the machine it the issue resurfaces. 
Any pointers on where this might be configured are welcome.


Also i found that glusterd is always restarted when a node is transitioning 
from maintenance/non-operational to active. Especially in the case the node is 
none-operational, and other nodes are also none-operational, this introduces 
extra instability, since the gluster service is constantly restarting causing 
quorum loss, making things worse. Maybe it's an idea to have some logic in 
place when gluster should be restarted by ovirt and when it's better to leave 
it running?

i also was thinking maybe it's a good idea to have an option on what should 
happen when a disk image becomes unavailable, currently you have the option to 
either pause the VM or kill it. Maybe a third option could be added, and thread 
this event as a removed/faulty disk. In this scenario you could for example 
setup a mirrored volume within the VM on 2 different gluster volumes, and let 
your VM continue running.
 
I've also upgraded to ovirt 4.3.3, and the messages about; Get Host Statistics 
failed: Internal JSON-RPC error: {'reason': '[Errno 19]
 veth7611c53 is not present in the system'} seems to be gone, but i cannot find 
a specific release note about it.

Hope we can also resolve the other issues.

Best Olaf
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ELTNB4IMCFWBKNZVPE7D7S4GLSEVSZSV/

Reply via email to