I have observed this behaviour recently and in the past on 4.3 and 4.4, and in my case it’s almost always following an ovirt upgrade. After upgrade (especially upgrades involving glusterfs) I’d have bricks randomly go down like your describing for about a week or so after upgrade and I’d have to manually start them. At some point it just corrects itself and is stable again. I really have no idea why it occurs and what’s happening that eventually stops it from happening.
On Wed, Jul 7, 2021 at 4:10 PM Jiří Sléžka <jiri.sle...@slu.cz> wrote: > Hello, > > I have 3 node HCI cluster with oVirt 4.4.6 and CentOS8. > > For time to time (I belive) random brick on random host goes down > because health-check. It looks like > > [root@ovirt-hci02 ~]# grep "posix_health_check" > /var/log/glusterfs/bricks/* > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 07:13:37.408184] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 07:13:37.408407] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 16:11:14.518971] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 16:11:14.519200] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > > on other host > > [root@ovirt-hci01 ~]# grep "posix_health_check" > /var/log/glusterfs/bricks/* > /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 > 13:15:51.983327] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 > 13:15:51.983728] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix: > still alive! -> SIGTERM > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 > 01:53:35.769129] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 > 01:53:35.769819] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > > I cannot link these errors to any storage/fs issue (in dmesg or > /var/log/messages), brick devices looks healthy (smartd). > > I can force start brick with > > gluster volume start vms|engine force > > and after some healing all works fine for few days > > Did anybody observe this behavior? > > vms volume has this structure (two bricks per host, each is separate > JBOD ssd disk), engine volume has one brick on each host... > > gluster volume info vms > > Volume Name: vms > Type: Distributed-Replicate > Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7 > Status: Started > Snapshot Count: 0 > Number of Bricks: 2 x 3 = 6 > Transport-type: tcp > Bricks: > Brick1: 10.0.4.11:/gluster_bricks/vms/vms > Brick2: 10.0.4.13:/gluster_bricks/vms/vms > Brick3: 10.0.4.12:/gluster_bricks/vms/vms > Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2 > Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2 > Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2 > Options Reconfigured: > cluster.granular-entry-heal: enable > performance.stat-prefetch: off > cluster.eager-lock: enable > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > user.cifs: off > network.ping-timeout: 30 > network.remote-dio: off > performance.strict-o-direct: on > performance.low-prio-threads: 32 > features.shard: on > storage.owner-gid: 36 > storage.owner-uid: 36 > transport.address-family: inet > storage.fips-mode-rchecksum: on > nfs.disable: on > performance.client-io-threads: off > > Cheers, > > Jiri > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/BPXG53NG34QKCABYJ35UYIWPNNWTKXW4/ >
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BZRONK53OGWSOPUSGQ76GIXUM7J6HHMJ/