[ovirt-users] Re: glusterfs health-check failed, (brick) going down

2021-07-11 Thread Jiří Sléžka

Hi Jayme,

On 7/8/21 12:54 PM, Jayme wrote:
I have observed this behaviour recently and in the past on 4.3 and 4.4, 
and in my case it’s almost always following an ovirt upgrade. After 
upgrade (especially upgrades involving glusterfs) I’d have bricks 
randomly go down like your describing for about a week or so after 
upgrade and I’d have to manually start them. At some point it just 
corrects itself and is stable again. I really have no idea why it occurs 
and what’s happening that eventually stops it from happening.


well, I agree that this issue probably follows oVirt upgrade. Until now 
no brick has failed but 4.4.7 is out and now I'm hesitant if I want to 
upgrade ;-) Of course I will, but probably a bit later.


In the gluster list, there is at least one other user who observe this 
behavior, unfortunately there is no known fix for this :-(


Cheers,

Jiri



On Wed, Jul 7, 2021 at 4:10 PM Jiří Sléžka > wrote:


Hello,

I have 3 node HCI cluster with oVirt 4.4.6 and CentOS8.

For time to time (I belive) random brick on random host goes down
because health-check. It looks like

[root@ovirt-hci02 ~]# grep "posix_health_check"
/var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
07:13:37.408184] M [MSGID: 113075]
[posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
health-check failed, going down
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
07:13:37.408407] M [MSGID: 113075]
[posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
still
alive! -> SIGTERM
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
16:11:14.518971] M [MSGID: 113075]
[posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
health-check failed, going down
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
16:11:14.519200] M [MSGID: 113075]
[posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
still
alive! -> SIGTERM

on other host

[root@ovirt-hci01 ~]# grep "posix_health_check"
/var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
13:15:51.983327] M [MSGID: 113075]
[posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
health-check failed, going down
/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
13:15:51.983728] M [MSGID: 113075]
[posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
still alive! -> SIGTERM
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
01:53:35.769129] M [MSGID: 113075]
[posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
health-check failed, going down
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
01:53:35.769819] M [MSGID: 113075]
[posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
still
alive! -> SIGTERM

I cannot link these errors to any storage/fs issue (in dmesg or
/var/log/messages), brick devices looks healthy (smartd).

I can force start brick with

gluster volume start vms|engine force

and after some healing all works fine for few days

Did anybody observe this behavior?

vms volume has this structure (two bricks per host, each is separate
JBOD ssd disk), engine volume has one brick on each host...

gluster volume info vms

Volume Name: vms
Type: Distributed-Replicate
Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.0.4.11:/gluster_bricks/vms/vms
Brick2: 10.0.4.13:/gluster_bricks/vms/vms
Brick3: 10.0.4.12:/gluster_bricks/vms/vms
Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
Options Reconfigured:
cluster.granular-entry-heal: enable
performance.stat-prefetch: off
cluster.eager-lock: enable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
user.cifs: off
network.ping-timeout: 30
network.remote-dio: off
performance.strict-o-direct: on
performance.low-prio-threads: 32
features.shard: on
storage.owner-gid: 36
storage.owner-uid: 36
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off

Cheers,

Jiri
___
Users mailing list -- users@ovirt.org 
To unsubscribe send an email to users-le...@ovirt.org

Privacy Statement: https://www.ovirt.org/privacy-policy.html

oVirt Code of Conduct:

[ovirt-users] Re: glusterfs health-check failed, (brick) going down

2021-07-08 Thread Jayme
I have observed this behaviour recently and in the past on 4.3 and 4.4, and
in my case it’s almost always following an ovirt upgrade. After upgrade
(especially upgrades involving glusterfs) I’d have bricks randomly go down
like your describing for about a week or so after upgrade and I’d have to
manually start them. At some point it just corrects itself and is stable
again. I really have no idea why it occurs and what’s happening that
eventually stops it from happening.

On Wed, Jul 7, 2021 at 4:10 PM Jiří Sléžka  wrote:

> Hello,
>
> I have 3 node HCI cluster with oVirt 4.4.6 and CentOS8.
>
> For time to time (I belive) random brick on random host goes down
> because health-check. It looks like
>
> [root@ovirt-hci02 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408184] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408407] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.518971] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.519200] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
>
> on other host
>
> [root@ovirt-hci01 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983327] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983728] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
> still alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769129] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769819] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
>
> I cannot link these errors to any storage/fs issue (in dmesg or
> /var/log/messages), brick devices looks healthy (smartd).
>
> I can force start brick with
>
> gluster volume start vms|engine force
>
> and after some healing all works fine for few days
>
> Did anybody observe this behavior?
>
> vms volume has this structure (two bricks per host, each is separate
> JBOD ssd disk), engine volume has one brick on each host...
>
> gluster volume info vms
>
> Volume Name: vms
> Type: Distributed-Replicate
> Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 3 = 6
> Transport-type: tcp
> Bricks:
> Brick1: 10.0.4.11:/gluster_bricks/vms/vms
> Brick2: 10.0.4.13:/gluster_bricks/vms/vms
> Brick3: 10.0.4.12:/gluster_bricks/vms/vms
> Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
> Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
> Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
> Options Reconfigured:
> cluster.granular-entry-heal: enable
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> user.cifs: off
> network.ping-timeout: 30
> network.remote-dio: off
> performance.strict-o-direct: on
> performance.low-prio-threads: 32
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> nfs.disable: on
> performance.client-io-threads: off
>
> Cheers,
>
> Jiri
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/BPXG53NG34QKCABYJ35UYIWPNNWTKXW4/
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BZRONK53OGWSOPUSGQ76GIXUM7J6HHMJ/


[ovirt-users] Re: glusterfs health-check failed, (brick) going down

2021-07-08 Thread Sandro Bonazzola
I would recommend asking also on glusterfs users mailing list about this.

Il giorno mer 7 lug 2021 alle ore 21:09 Jiří Sléžka  ha
scritto:

> Hello,
>
> I have 3 node HCI cluster with oVirt 4.4.6 and CentOS8.
>
> For time to time (I belive) random brick on random host goes down
> because health-check. It looks like
>
> [root@ovirt-hci02 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408184] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408407] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.518971] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.519200] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
>
> on other host
>
> [root@ovirt-hci01 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983327] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983728] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
> still alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769129] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769819] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
>
> I cannot link these errors to any storage/fs issue (in dmesg or
> /var/log/messages), brick devices looks healthy (smartd).
>
> I can force start brick with
>
> gluster volume start vms|engine force
>
> and after some healing all works fine for few days
>
> Did anybody observe this behavior?
>
> vms volume has this structure (two bricks per host, each is separate
> JBOD ssd disk), engine volume has one brick on each host...
>
> gluster volume info vms
>
> Volume Name: vms
> Type: Distributed-Replicate
> Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 3 = 6
> Transport-type: tcp
> Bricks:
> Brick1: 10.0.4.11:/gluster_bricks/vms/vms
> Brick2: 10.0.4.13:/gluster_bricks/vms/vms
> Brick3: 10.0.4.12:/gluster_bricks/vms/vms
> Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
> Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
> Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
> Options Reconfigured:
> cluster.granular-entry-heal: enable
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> user.cifs: off
> network.ping-timeout: 30
> network.remote-dio: off
> performance.strict-o-direct: on
> performance.low-prio-threads: 32
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> nfs.disable: on
> performance.client-io-threads: off
>
> Cheers,
>
> Jiri
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/BPXG53NG34QKCABYJ35UYIWPNNWTKXW4/
>


-- 

Sandro Bonazzola

MANAGER, SOFTWARE ENGINEERING, EMEA R RHV

Red Hat EMEA 

sbona...@redhat.com


*Red Hat respects your work life balance. Therefore there is no need to
answer this email out of your office hours.*
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BMIRJBLFWAVSTKFERBXKLV6KDD4UIZSD/