Re: [Gluster-users] random disconnects of peers
By the way, try to capture the traffic on the systems and compare if only specific packages are not delivered to the destination. Overall JF won't give you a 2-digit improvement, so in your case I would switch to 1500 MTU. Best Regards,Strahil Nikolov I already updated the firmware of the NICs few weeks ago. The switch firmware is up to date. I already changed the whole switch to a completely different model without effect before (maybe 6 months ago). There are no other systems which are using jumbo frames attached to the switch. Am 18.09.2022 21:07 schrieb Strahil Nikolov: > We are currently shooting in the dark... > If possible update the Firmware of the NICs and FW of the switch . > > Have you tried if other systems (on the same switch) have issues with > the Jumbo Frames ? > > Best Regards, > Strahil Nikolov > >> Yes, i did test the ping with a jumbo frame mtu and it worked >> without >> problems. There is no firewall between the storage nodes and the >> hypervisors. They are using the same layer 2 subnet, so there is >> only >> the switch in between. On the switch jumbo frames for the specific >> wlan >> is enabled. >> >> I also increased the tx and rx queue length, without succes in >> relation >> to the problem. >> >> Am 17.09.2022 10:39 schrieb Strahil Nikolov: >>> Usually that kind of problems could be on many places. >>> When you set the MTU to 9000, did you test with ping and the "Do >> not >>> fragment" Flag ? >>> >>> If there is a device on the path that is not configured (or >> doesn't >>> support MTU9000) , it will fragment all packets and that could >> lead to >>> excessive device CPU consumption. I have seen many firewalls to >> not >>> use JF by default. >>> >>> ping -M do -s 8972 >>> >>> Best Regards, >>> Strahil Nikolov >>> >>> В петък, 16 септември 2022 г., 22:24:14 ч. >>> Гринуич+3, Gionatan Danti >> написа: >>> >>> Il 2022-09-16 18:41 dpglus...@posteo.de ha scritto: I have made extensive load tests in the last few days and figured >>> out it's definitely a network related issue. I changed from jumbo >> frames (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem doesn't occur. I'm able to bump the io-wait of our gluster >> storage servers to the max possible values of the disks without any error >> or connection loss between the hypervisors or the storage nodes. As mentioned in multiple gluster best practices it's recommended >> to use jumbo frames in gluster setups for better performance. So I >>> would like to use jumbo frames in my datacenter. What could be the issue here? >>> >>> I would try with a jumbo frame setting of 4074 (or 4088) bytes. >>> >>> Regards. >>> >>> -- >>> Danti Gionatan >>> Supporto Tecnico >>> Assyoma S.r.l. - www.assyoma.it >>> email: g.da...@assyoma.it - i...@assyoma.it >>> GPG public key ID: FF5F32A8 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] random disconnects of peers
We are currently shooting in the dark...If possible update the Firmware of the NICs and FW of the switch . Have you tried if other systems (on the same switch) have issues with the Jumbo Frames ? Best Regards,Strahil Nikolov Yes, i did test the ping with a jumbo frame mtu and it worked without problems. There is no firewall between the storage nodes and the hypervisors. They are using the same layer 2 subnet, so there is only the switch in between. On the switch jumbo frames for the specific wlan is enabled. I also increased the tx and rx queue length, without succes in relation to the problem. Am 17.09.2022 10:39 schrieb Strahil Nikolov: > Usually that kind of problems could be on many places. > When you set the MTU to 9000, did you test with ping and the "Do not > fragment" Flag ? > > If there is a device on the path that is not configured (or doesn't > support MTU9000) , it will fragment all packets and that could lead to > excessive device CPU consumption. I have seen many firewalls to not > use JF by default. > > ping -M do -s 8972 > > Best Regards, > Strahil Nikolov > > В петък, 16 септември 2022 г., 22:24:14 ч. > Гринуич+3, Gionatan Danti написа: > > Il 2022-09-16 18:41 dpglus...@posteo.de ha scritto: >> I have made extensive load tests in the last few days and figured > out >> it's definitely a network related issue. I changed from jumbo frames >> (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem >> doesn't occur. I'm able to bump the io-wait of our gluster storage >> servers to the max possible values of the disks without any error or >> connection loss between the hypervisors or the storage nodes. >> >> As mentioned in multiple gluster best practices it's recommended to >> use jumbo frames in gluster setups for better performance. So I > would >> like to use jumbo frames in my datacenter. >> >> What could be the issue here? > > I would try with a jumbo frame setting of 4074 (or 4088) bytes. > > Regards. > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.da...@assyoma.it - i...@assyoma.it > GPG public key ID: FF5F32A8 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] random disconnects of peers
Usually that kind of problems could be on many places. When you set the MTU to 9000, did you test with ping and the "Do not fragment" Flag ? If there is a device on the path that is not configured (or doesn't support MTU9000) , it will fragment all packets and that could lead to excessive device CPU consumption. I have seen many firewalls to not use JF by default. ping -M do -s 8972 Best Regards, Strahil Nikolov В петък, 16 септември 2022 г., 22:24:14 ч. Гринуич+3, Gionatan Danti написа: Il 2022-09-16 18:41 dpglus...@posteo.de ha scritto: > I have made extensive load tests in the last few days and figured out > it's definitely a network related issue. I changed from jumbo frames > (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem > doesn't occur. I'm able to bump the io-wait of our gluster storage > servers to the max possible values of the disks without any error or > connection loss between the hypervisors or the storage nodes. > > As mentioned in multiple gluster best practices it's recommended to > use jumbo frames in gluster setups for better performance. So I would > like to use jumbo frames in my datacenter. > > What could be the issue here? I would try with a jumbo frame setting of 4074 (or 4088) bytes. Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.da...@assyoma.it - i...@assyoma.it GPG public key ID: FF5F32A8 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] random disconnects of peers
Il 2022-09-16 18:41 dpglus...@posteo.de ha scritto: I have made extensive load tests in the last few days and figured out it's definitely a network related issue. I changed from jumbo frames (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem doesn't occur. I'm able to bump the io-wait of our gluster storage servers to the max possible values of the disks without any error or connection loss between the hypervisors or the storage nodes. As mentioned in multiple gluster best practices it's recommended to use jumbo frames in gluster setups for better performance. So I would like to use jumbo frames in my datacenter. What could be the issue here? I would try with a jumbo frame setting of 4074 (or 4088) bytes. Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.da...@assyoma.it - i...@assyoma.it GPG public key ID: FF5F32A8 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] random disconnects of peers
I have made extensive load tests in the last few days and figured out it's definitely a network related issue. I changed from jumbo frames (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem doesn't occur. I'm able to bump the io-wait of our gluster storage servers to the max possible values of the disks without any error or connection loss between the hypervisors or the storage nodes. As mentioned in multiple gluster best practices it's recommended to use jumbo frames in gluster setups for better performance. So I would like to use jumbo frames in my datacenter. What could be the issue here? Am 19.08.2022 07:47 schrieb Strahil Nikolov: You can check the max op version and if only the oVirt nodes are using it -> bump it to the maximum. I upgraded my 4.4 while preserving the Gluster storage - just back up the /etc/glusterfs & /var/lib/glusterd . Keep in mind that if you use VDO you need to backup it's config too. Best Regards, Strahil Nikolov Yes, the firmware update of the network adapters is planned for the next week. The tcpdump is currently running and i will share the result with you. The update to ovirt 4.4 (and to 4.5) is quite a big deal because of the switch to CentOS stream where a full reinstall is required and there is no possibility to preserve local storage on standalone hypervisors. :P Gluster opversion is 6. Am 18.08.2022 23:46 schrieb Strahil Nikolov: Usually I start with firmware updates/OS updates. You can be surprised how many times bad firmware (or dieing NIC) has left me puzzled. I also support the tcpdump - let it run on all nodes and it might give a clue what is causing it. I think there is no need to remind you that you should update to oVirt 4.4 and then to 4.5 ;) By the way, what is your cluster OP version ? Best Regards, Strahil Nikolov On Thu, Aug 18, 2022 at 14:27, Péter Károly JUHÁSZ wrote: Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk [1][1] Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users [2][2] Links: -- [1] https://meet.google.com/cpu-eiue-hvk [1] [2] https://lists.gluster.org/mailman/listinfo/gluster-users [2] Links: -- [1] https://meet.google.com/cpu-eiue-hvk [2] https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] random disconnects of peers
Yes, the firmware update of the network adapters is planned for the next week. The tcpdump is currently running and i will share the result with you. The update to ovirt 4.4 (and to 4.5) is quite a big deal because of the switch to CentOS stream where a full reinstall is required and there is no possibility to preserve local storage on standalone hypervisors. :P Gluster opversion is 6. Am 18.08.2022 23:46 schrieb Strahil Nikolov: Usually I start with firmware updates/OS updates. You can be surprised how many times bad firmware (or dieing NIC) has left me puzzled. I also support the tcpdump - let it run on all nodes and it might give a clue what is causing it. I think there is no need to remind you that you should update to oVirt 4.4 and then to 4.5 ;) By the way, what is your cluster OP version ? Best Regards, Strahil Nikolov On Thu, Aug 18, 2022 at 14:27, Péter Károly JUHÁSZ wrote: Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk [1] Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users [2] Links: -- [1] https://meet.google.com/cpu-eiue-hvk [2] https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] random disconnects of peers
Did you tired to TCPdump the connections to see who and how closes the connection? Normal fin-ack, or timeout? Maybe some network device between? (This later has small probably since you told that you can trigger the error by high load.) 于 2022年8月18日周四 12:38写道: > I just niced all glusterfsd processes on all nodes to a value of -10. > The problem just occured, so it seems nicing the processes didn't help. > > Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ: > > What if you renice the gluster processes to some negative value? > > > > 于 2022年8月18日周四 09:45写道: > > > >> Hi folks, > >> > >> i am running multiple GlusterFS servers in multiple datacenters. > >> Every > >> datacenter is basically the same setup: 3x storage nodes, 3x kvm > >> hypervisors (oVirt) and 2x HPE switches which are acting as one > >> logical > >> unit. The NICs of all servers are attached to both switches with a > >> bonding of two NICs, in case one of the switches has a major > >> problem. > >> In one datacenter i have strange problems with the glusterfs for > >> nearly > >> half of a year now and i'm not able to figure out the root cause. > >> > >> Enviorment > >> - glusterfs 9.5 running on a centos 7.9.2009 (Core) > >> - three gluster volumes, all options equally configured > >> > >> root@storage-001# gluster volume info > >> Volume Name: g-volume-domain > >> Type: Replicate > >> Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0 > >> Status: Started > >> Snapshot Count: 0 > >> Number of Bricks: 1 x 3 = 3 > >> Transport-type: tcp > >> Bricks: > >> Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain > >> Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain > >> Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain > >> Options Reconfigured: > >> client.event-threads: 4 > >> performance.cache-size: 1GB > >> server.event-threads: 4 > >> server.allow-insecure: On > >> network.ping-timeout: 42 > >> performance.client-io-threads: off > >> nfs.disable: on > >> transport.address-family: inet > >> cluster.quorum-type: auto > >> network.remote-dio: enable > >> cluster.eager-lock: enable > >> performance.stat-prefetch: off > >> performance.io-cache: off > >> performance.quick-read: off > >> cluster.data-self-heal-algorithm: diff > >> storage.owner-uid: 36 > >> storage.owner-gid: 36 > >> performance.readdir-ahead: on > >> performance.read-ahead: off > >> client.ssl: off > >> server.ssl: off > >> auth.ssl-allow: > >> > > > storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain > >> ssl.cipher-list: HIGH:!SSLv2 > >> cluster.shd-max-threads: 4 > >> diagnostics.latency-measurement: on > >> diagnostics.count-fop-hits: on > >> performance.io-thread-count: 32 > >> > >> Problem > >> The glusterd on one storage node seems to loose connection to one > >> another storage node. If the problem occurs, the first message in > >> /var/log/glusterfs/glusterd.log is always the following (variable > >> values > >> are filled with "x": > >> [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> (), > >> in > >> state , has disconnected from glusterd. > >> > >> I will post a filtered log for this specific error on each of my > >> storage > >> nodes below. > >> storage-001: > >> root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log | > >> grep > >> "has disconnected from" | grep "2022-08-16" > >> [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), > >> in > >> state , has disconnected from glusterd. > >> [2022-08-16 05:34:47.721060 +] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> (), > >> in > >> state , has disconnected from glusterd. > >> [2022-08-16 06:01:22.472973 +] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), > >> in > >> state , has disconnected from glusterd. > >> root@storage-001# > >> > >> storage-002: > >> root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log | > >> grep > >> "has disconnected from" | grep "2022-08-16" > >> [2022-08-16 05:01:34.502322 +] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> (), > >> in > >> state , has disconnected from glusterd. > >> [2022-08-16 05:19:16.898406 +] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> (), > >> in > >> state , has disconnected from glusterd. > >> [2022-08-16 06:01:22.462676 +] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> (), > >> in > >> state , has disconnected from glusterd. > >> [2022-08-16 10:17:52.154501 +] I [MSGID: 106004] > >>
Re: [Gluster-users] random disconnects of peers
I just niced all glusterfsd processes on all nodes to a value of -10. The problem just occured, so it seems nicing the processes didn't help. Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ: What if you renice the gluster processes to some negative value? 于 2022年8月18日周四 09:45写道: Hi folks, i am running multiple GlusterFS servers in multiple datacenters. Every datacenter is basically the same setup: 3x storage nodes, 3x kvm hypervisors (oVirt) and 2x HPE switches which are acting as one logical unit. The NICs of all servers are attached to both switches with a bonding of two NICs, in case one of the switches has a major problem. In one datacenter i have strange problems with the glusterfs for nearly half of a year now and i'm not able to figure out the root cause. Enviorment - glusterfs 9.5 running on a centos 7.9.2009 (Core) - three gluster volumes, all options equally configured root@storage-001# gluster volume info Volume Name: g-volume-domain Type: Replicate Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain Options Reconfigured: client.event-threads: 4 performance.cache-size: 1GB server.event-threads: 4 server.allow-insecure: On network.ping-timeout: 42 performance.client-io-threads: off nfs.disable: on transport.address-family: inet cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.quick-read: off cluster.data-self-heal-algorithm: diff storage.owner-uid: 36 storage.owner-gid: 36 performance.readdir-ahead: on performance.read-ahead: off client.ssl: off server.ssl: off auth.ssl-allow: storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain ssl.cipher-list: HIGH:!SSLv2 cluster.shd-max-threads: 4 diagnostics.latency-measurement: on diagnostics.count-fop-hits: on performance.io-thread-count: 32 Problem The glusterd on one storage node seems to loose connection to one another storage node. If the problem occurs, the first message in /var/log/glusterfs/glusterd.log is always the following (variable values are filled with "x": [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. I will post a filtered log for this specific error on each of my storage nodes below. storage-001: root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 05:34:47.721060 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 06:01:22.472973 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. root@storage-001# storage-002: root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:01:34.502322 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 05:19:16.898406 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 06:01:22.462676 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 10:17:52.154501 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. root@storage-002# storage-003: root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:24:18.225432 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 05:27:22.683234 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 10:17:50.624775 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd.
Re: [Gluster-users] random disconnects of peers
What if you renice the gluster processes to some negative value? 于 2022年8月18日周四 09:45写道: > Hi folks, > > i am running multiple GlusterFS servers in multiple datacenters. Every > datacenter is basically the same setup: 3x storage nodes, 3x kvm > hypervisors (oVirt) and 2x HPE switches which are acting as one logical > unit. The NICs of all servers are attached to both switches with a > bonding of two NICs, in case one of the switches has a major problem. > In one datacenter i have strange problems with the glusterfs for nearly > half of a year now and i'm not able to figure out the root cause. > > Enviorment > - glusterfs 9.5 running on a centos 7.9.2009 (Core) > - three gluster volumes, all options equally configured > > root@storage-001# gluster volume info > Volume Name: g-volume-domain > Type: Replicate > Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain > Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain > Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain > Options Reconfigured: > client.event-threads: 4 > performance.cache-size: 1GB > server.event-threads: 4 > server.allow-insecure: On > network.ping-timeout: 42 > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > cluster.quorum-type: auto > network.remote-dio: enable > cluster.eager-lock: enable > performance.stat-prefetch: off > performance.io-cache: off > performance.quick-read: off > cluster.data-self-heal-algorithm: diff > storage.owner-uid: 36 > storage.owner-gid: 36 > performance.readdir-ahead: on > performance.read-ahead: off > client.ssl: off > server.ssl: off > auth.ssl-allow: > > storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain > ssl.cipher-list: HIGH:!SSLv2 > cluster.shd-max-threads: 4 > diagnostics.latency-measurement: on > diagnostics.count-fop-hits: on > performance.io-thread-count: 32 > > Problem > The glusterd on one storage node seems to loose connection to one > another storage node. If the problem occurs, the first message in > /var/log/glusterfs/glusterd.log is always the following (variable values > are filled with "x": > [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (), in > state , has disconnected from glusterd. > > I will post a filtered log for this specific error on each of my storage > nodes below. > storage-001: > root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log | grep > "has disconnected from" | grep "2022-08-16" > [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state , has disconnected from glusterd. > [2022-08-16 05:34:47.721060 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (), in > state , has disconnected from glusterd. > [2022-08-16 06:01:22.472973 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state , has disconnected from glusterd. > root@storage-001# > > storage-002: > root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log | grep > "has disconnected from" | grep "2022-08-16" > [2022-08-16 05:01:34.502322 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (), in > state , has disconnected from glusterd. > [2022-08-16 05:19:16.898406 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (), in > state , has disconnected from glusterd. > [2022-08-16 06:01:22.462676 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (), in > state , has disconnected from glusterd. > [2022-08-16 10:17:52.154501 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (), in > state , has disconnected from glusterd. > root@storage-002# > > storage-003: > root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log | grep > "has disconnected from" | grep "2022-08-16" > [2022-08-16 05:24:18.225432 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state , has disconnected from glusterd. > [2022-08-16 05:27:22.683234 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in > state , has disconnected from glusterd. > [2022-08-16 10:17:50.624775 +] I [MSGID: 106004] > [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer > (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in >