[ovirt-users] Re: glusterfs health-check failed, (brick) going down

Jiří Sléžka Sun, 11 Jul 2021 13:07:15 -0700

Hi Jayme,

On 7/8/21 12:54 PM, Jayme wrote:

I have observed this behaviour recently and in the past on 4.3 and 4.4,and in my case it’s almost always following an ovirt upgrade. Afterupgrade (especially upgrades involving glusterfs) I’d have bricksrandomly go down like your describing for about a week or so afterupgrade and I’d have to manually start them. At some point it justcorrects itself and is stable again. I really have no idea why it occursand what’s happening that eventually stops it from happening.

well, I agree that this issue probably follows oVirt upgrade. Until nowno brick has failed but 4.4.7 is out and now I'm hesitant if I want toupgrade ;-) Of course I will, but probably a bit later.

In the gluster list, there is at least one other user who observe thisbehavior, unfortunately there is no known fix for this :-(


Cheers,

Jiri

On Wed, Jul 7, 2021 at 4:10 PM Jiří Sléžka <jiri.sle...@slu.cz<mailto:jiri.sle...@slu.cz>> wrote:


    Hello,

    I have 3 node HCI cluster with oVirt 4.4.6 and CentOS8.

    For time to time (I belive) random brick on random host goes down
    because health-check. It looks like

    [root@ovirt-hci02 ~]# grep "posix_health_check"
    /var/log/glusterfs/bricks/*
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
    07:13:37.408184] M [MSGID: 113075]
    [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
    health-check failed, going down
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
    07:13:37.408407] M [MSGID: 113075]
    [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
    still
    alive! -> SIGTERM
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
    16:11:14.518971] M [MSGID: 113075]
    [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
    health-check failed, going down
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
    16:11:14.519200] M [MSGID: 113075]
    [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
    still
    alive! -> SIGTERM

    on other host

    [root@ovirt-hci01 ~]# grep "posix_health_check"
    /var/log/glusterfs/bricks/*
    /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
    13:15:51.983327] M [MSGID: 113075]
    [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
    health-check failed, going down
    /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
    13:15:51.983728] M [MSGID: 113075]
    [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
    still alive! -> SIGTERM
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
    01:53:35.769129] M [MSGID: 113075]
    [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
    health-check failed, going down
    /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
    01:53:35.769819] M [MSGID: 113075]
    [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
    still
    alive! -> SIGTERM

    I cannot link these errors to any storage/fs issue (in dmesg or
    /var/log/messages), brick devices looks healthy (smartd).

    I can force start brick with

    gluster volume start vms|engine force

    and after some healing all works fine for few days

    Did anybody observe this behavior?

    vms volume has this structure (two bricks per host, each is separate
    JBOD ssd disk), engine volume has one brick on each host...

    gluster volume info vms

    Volume Name: vms
    Type: Distributed-Replicate
    Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
    Status: Started
    Snapshot Count: 0
    Number of Bricks: 2 x 3 = 6
    Transport-type: tcp
    Bricks:
    Brick1: 10.0.4.11:/gluster_bricks/vms/vms
    Brick2: 10.0.4.13:/gluster_bricks/vms/vms
    Brick3: 10.0.4.12:/gluster_bricks/vms/vms
    Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
    Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
    Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
    Options Reconfigured:
    cluster.granular-entry-heal: enable
    performance.stat-prefetch: off
    cluster.eager-lock: enable
    performance.io-cache: off
    performance.read-ahead: off
    performance.quick-read: off
    user.cifs: off
    network.ping-timeout: 30
    network.remote-dio: off
    performance.strict-o-direct: on
    performance.low-prio-threads: 32
    features.shard: on
    storage.owner-gid: 36
    storage.owner-uid: 36
    transport.address-family: inet
    storage.fips-mode-rchecksum: on
    nfs.disable: on
    performance.client-io-threads: off

    Cheers,

    Jiri
    _______________________________________________
    Users mailing list -- users@ovirt.org <mailto:users@ovirt.org>
    To unsubscribe send an email to users-le...@ovirt.org
    <mailto:users-le...@ovirt.org>
    Privacy Statement: https://www.ovirt.org/privacy-policy.html
    <https://www.ovirt.org/privacy-policy.html>
    oVirt Code of Conduct:
    https://www.ovirt.org/community/about/community-guidelines/
    <https://www.ovirt.org/community/about/community-guidelines/>
    List Archives:
    
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BPXG53NG34QKCABYJ35UYIWPNNWTKXW4/
    
<https://lists.ovirt.org/archives/list/users@ovirt.org/message/BPXG53NG34QKCABYJ35UYIWPNNWTKXW4/>

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/KFSSO7DURA3JGAECJGLIXKWNCFDRBQVV/

[ovirt-users] Re: glusterfs health-check failed, (brick) going down

Reply via email to