Dear Mohit,
I've upgraded to gluster 5.6, however the starting of multiple glusterfsd
processed per brick doesn't seems to be fully resolved yet. However it does
seem to happen less than before. Also in some cases glusterd did seem to detect
a glusterfsd was running, but decided it was not valid. It was reproducible on
all my machines after a reboot, but only a few bricks seemed to be affected.
I'm running about 14 bricks per machine, and only 1 - 3 were affected. The ones
with 3 full bricks, seemed tp suffer most. Also in some cases a restart of the
glusterd service did spawn multiple glusterfsd processed for the same bricks
configured on the node.
See for example logs;
[2019-04-19 17:49:50.853099] I [glusterd-utils.c:6214:glusterd_brick_start]
0-management: discovered already-running brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:50:33.302239] I [glusterd-utils.c:6214:glusterd_brick_start]
0-management: discovered already-running brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:56:11.287692] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 17:57:12.699967] I [glusterd-utils.c:6184:glusterd_brick_start]
0-management: Either pid 14884 is not running or brick path
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-19 17:57:12.700150] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:02:58.420870] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:03:29.420891] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:48:14.046029] I [glusterd-utils.c:6214:glusterd_brick_start]
0-management: discovered already-running brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-19 18:55:04.508606] I [glusterd-utils.c:6214:glusterd_brick_start]
0-management: discovered already-running brick
/data/gfs/bricks/brick1/ovirt-core
or
[2019-04-18 17:00:00.665476] I [glusterd-utils.c:6214:glusterd_brick_start]
0-management: discovered already-running brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:00:32.799529] I [glusterd-utils.c:6214:glusterd_brick_start]
0-management: discovered already-running brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:02:38.271880] I [glusterd-utils.c:6214:glusterd_brick_start]
0-management: discovered already-running brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:08:32.867046] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:00.440336] I [glusterd-utils.c:6184:glusterd_brick_start]
0-management: Either pid 9278 is not running or brick path
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:00.440476] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:07.644070] I [glusterd-utils.c:6184:glusterd_brick_start]
0-management: Either pid 24126 is not running or brick path
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:07.644184] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:13.785798] I [glusterd-utils.c:6184:glusterd_brick_start]
0-management: Either pid 27197 is not running or brick path
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:13.785918] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:09:24.344561] I [glusterd-utils.c:6184:glusterd_brick_start]
0-management: Either pid 28468 is not running or brick path
/data/gfs/bricks/brick1/ovirt-core is not consumed so cleanup pidfile
[2019-04-18 17:09:24.344675] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 17:37:07.150799] I [glusterd-utils.c:6214:glusterd_brick_start]
0-management: discovered already-running brick
/data/gfs/bricks/brick1/ovirt-core
[2019-04-18 18:17:23.203719] I [glusterd-utils.c:6301:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/data/gfs/bricks/brick1/ovirt-core
Again the the procedure to resolve this, was kill all the glusterfsd processed
for the brick, and do a gluster v <VOL> start force, which resulted in only 1
processes being started.
After the upgrade to 5.6 i do notice a small performance improvement of around
15%, but it's still far from 3.12.15. I don't experience a drop in network
utilisation, but i doubt i ever suffered from that issue, as since as long as i
run gluster (3.7), the usage was always between on average: 15 - 180 Mbps. And
depending on machine and hosted bricks/brick types gravitates around
30Mbps/80Mbps/160Mbps.
I also found the reason the ovs-vswitchd starts using 100% cpu, it appears the
one of the machines tries to add an interface twice on all other machines. I
don't really understand where this is configured;
801cc877-dd59-4b73-9cd4-6e89b7dd4245
Bridge br-int
fail_mode: secure
Port "ovn-ab29e1-0"
Interface "ovn-ab29e1-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.5"}
Port "ovn-e1f5eb-0"
Interface "ovn-e1f5eb-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.7"}
Port "ovn-17c441-0"
Interface "ovn-17c441-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.21"}
Port "ovn-6a362b-0"
Interface "ovn-6a362b-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.5"}
error: "could not add network device ovn-6a362b-0 to ofproto
(File exists)"
Port "ovn-99caac-0"
Interface "ovn-99caac-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.20"}
Port "ovn-1c9643-0"
Interface "ovn-1c9643-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.6"}
Port "ovn-2e5821-0"
Interface "ovn-2e5821-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.8"}
Port "ovn-484b7e-0"
Interface "ovn-484b7e-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.9"}
Port br-int
Interface br-int
type: internal
Port "ovn-0522c9-0"
Interface "ovn-0522c9-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.32.9.4"}
Port "ovn-437985-0"
Interface "ovn-437985-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.0.6.1"}
ovs_version: "2.10.1"
It seems the interface for 10.32.9.5 is added twice; ovn-6a362b-0 and
ovn-ab29e1-0.
Manually removing the inferface with ovs-vsctl, doesn't help. The only thing
which seems to resolve it is restarting openvswitch service on 10.32.9.5,
however when i reboot the machine it the issue resurfaces.
Any pointers on where this might be configured are welcome.
Also i found that glusterd is always restarted when a node is transitioning
from maintenance/non-operational to active. Especially in the case the node is
none-operational, and other nodes are also none-operational, this introduces
extra instability, since the gluster service is constantly restarting causing
quorum loss, making things worse. Maybe it's an idea to have some logic in
place when gluster should be restarted by ovirt and when it's better to leave
it running?
i also was thinking maybe it's a good idea to have an option on what should
happen when a disk image becomes unavailable, currently you have the option to
either pause the VM or kill it. Maybe a third option could be added, and thread
this event as a removed/faulty disk. In this scenario you could for example
setup a mirrored volume within the VM on 2 different gluster volumes, and let
your VM continue running.
I've also upgraded to ovirt 4.3.3, and the messages about; Get Host Statistics
failed: Internal JSON-RPC error: {'reason': '[Errno 19]
veth7611c53 is not present in the system'} seems to be gone, but i cannot find
a specific release note about it.
Hope we can also resolve the other issues.
Best Olaf
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/[email protected]/message/ELTNB4IMCFWBKNZVPE7D7S4GLSEVSZSV/