[ovirt-users] glusterfs health-check failed, (brick) going down

Jiří Sléžka Wed, 07 Jul 2021 12:09:53 -0700

Hello,

I have 3 node HCI cluster with oVirt 4.4.6 and CentOS8.

For time to time (I belive) random brick on random host goes downbecause health-check. It looks like


[root@ovirt-hci02 ~]# grep "posix_health_check" /var/log/glusterfs/bricks/*

/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-0707:13:37.408184] M [MSGID: 113075][posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:health-check failed, going down/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-0707:13:37.408407] M [MSGID: 113075][posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: stillalive! -> SIGTERM/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-0716:11:14.518971] M [MSGID: 113075][posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:health-check failed, going down/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-0716:11:14.519200] M [MSGID: 113075][posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: stillalive! -> SIGTERM


on other host

[root@ovirt-hci01 ~]# grep "posix_health_check" /var/log/glusterfs/bricks/*

/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-0513:15:51.983327] M [MSGID: 113075][posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:health-check failed, going down/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-0513:15:51.983728] M [MSGID: 113075][posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:still alive! -> SIGTERM/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-0501:53:35.769129] M [MSGID: 113075][posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:health-check failed, going down/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-0501:53:35.769819] M [MSGID: 113075][posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: stillalive! -> SIGTERM

I cannot link these errors to any storage/fs issue (in dmesg or/var/log/messages), brick devices looks healthy (smartd).


I can force start brick with

gluster volume start vms|engine force

and after some healing all works fine for few days

Did anybody observe this behavior?

vms volume has this structure (two bricks per host, each is separateJBOD ssd disk), engine volume has one brick on each host...


gluster volume info vms

Volume Name: vms
Type: Distributed-Replicate
Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.0.4.11:/gluster_bricks/vms/vms
Brick2: 10.0.4.13:/gluster_bricks/vms/vms
Brick3: 10.0.4.12:/gluster_bricks/vms/vms
Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
Options Reconfigured:
cluster.granular-entry-heal: enable
performance.stat-prefetch: off
cluster.eager-lock: enable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
user.cifs: off
network.ping-timeout: 30
network.remote-dio: off
performance.strict-o-direct: on
performance.low-prio-threads: 32
features.shard: on
storage.owner-gid: 36
storage.owner-uid: 36
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off

Cheers,

Jiri
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BPXG53NG34QKCABYJ35UYIWPNNWTKXW4/

[ovirt-users] glusterfs health-check failed, (brick) going down

Reply via email to