Re: [Gluster-users] random disconnects of peers
I just niced all glusterfsd processes on all nodes to a value of -10. The problem just occured, so it seems nicing the processes didn't help. Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ: What if you renice the gluster processes to some negative value? 于 2022年8月18日周四 09:45写道: Hi folks, i am running multiple GlusterFS servers in multiple datacenters. Every datacenter is basically the same setup: 3x storage nodes, 3x kvm hypervisors (oVirt) and 2x HPE switches which are acting as one logical unit. The NICs of all servers are attached to both switches with a bonding of two NICs, in case one of the switches has a major problem. In one datacenter i have strange problems with the glusterfs for nearly half of a year now and i'm not able to figure out the root cause. Enviorment - glusterfs 9.5 running on a centos 7.9.2009 (Core) - three gluster volumes, all options equally configured root@storage-001# gluster volume info Volume Name: g-volume-domain Type: Replicate Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain Options Reconfigured: client.event-threads: 4 performance.cache-size: 1GB server.event-threads: 4 server.allow-insecure: On network.ping-timeout: 42 performance.client-io-threads: off nfs.disable: on transport.address-family: inet cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.quick-read: off cluster.data-self-heal-algorithm: diff storage.owner-uid: 36 storage.owner-gid: 36 performance.readdir-ahead: on performance.read-ahead: off client.ssl: off server.ssl: off auth.ssl-allow: storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain ssl.cipher-list: HIGH:!SSLv2 cluster.shd-max-threads: 4 diagnostics.latency-measurement: on diagnostics.count-fop-hits: on performance.io-thread-count: 32 Problem The glusterd on one storage node seems to loose connection to one another storage node. If the problem occurs, the first message in /var/log/glusterfs/glusterd.log is always the following (variable values are filled with "x": [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. I will post a filtered log for this specific error on each of my storage nodes below. storage-001: root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 05:34:47.721060 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 06:01:22.472973 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. root@storage-001# storage-002: root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:01:34.502322 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 05:19:16.898406 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 06:01:22.462676 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 10:17:52.154501 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. root@storage-002# storage-003: root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:24:18.225432 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 05:27:22.683234 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 10:17:50.624775 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd.
Re: [Gluster-users] random disconnects of peers
Yes, the firmware update of the network adapters is planned for the next week. The tcpdump is currently running and i will share the result with you. The update to ovirt 4.4 (and to 4.5) is quite a big deal because of the switch to CentOS stream where a full reinstall is required and there is no possibility to preserve local storage on standalone hypervisors. :P Gluster opversion is 6. Am 18.08.2022 23:46 schrieb Strahil Nikolov: Usually I start with firmware updates/OS updates. You can be surprised how many times bad firmware (or dieing NIC) has left me puzzled. I also support the tcpdump - let it run on all nodes and it might give a clue what is causing it. I think there is no need to remind you that you should update to oVirt 4.4 and then to 4.5 ;) By the way, what is your cluster OP version ? Best Regards, Strahil Nikolov On Thu, Aug 18, 2022 at 14:27, Péter Károly JUHÁSZ wrote: Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk [1] Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users [2] Links: -- [1] https://meet.google.com/cpu-eiue-hvk [2] https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] gluster volume not healing - remote operation failed
Hi folks, my gluster volume isn't fully healing. We had an outage couple days ago and all other files got healed successfully. Now - days later - i can see there are still two gfid's per node remaining in healing list. root@storage-001~# for i in `gluster volume list`; do gluster volume heal $i info; done Brick storage-003.mydomain.com:/mnt/bricks/g-volume-myvolume Status: Connected Number of entries: 2 Brick storage-002.mydomain.com:/mnt/bricks/g-volume-myvolume Status: Connected Number of entries: 2 Brick storage-001.mydomain.com:/mnt/bricks/g-volume-myvolume Status: Connected Number of entries: 2 In the log i can see that the glustershd process is invoked to heal the reamining files but fails with "remote operation failed". [2022-09-14 10:56:50.007978 +] I [MSGID: 108026] [afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-g-volume-myvolume-replicate-0: performing entry selfheal on 48791313-e5e7-44df-bf99-3ebc8d4cf5d5 [2022-09-14 10:56:50.008428 +] I [MSGID: 108026] [afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-g-volume-myvolume-replicate-0: performing entry selfheal on a4babc5a-bd5a-4429-b65e-758651d5727c [2022-09-14 10:56:50.015005 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2022-09-14 10:56:50.015007 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-3: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2022-09-14 10:56:50.015138 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-4: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2022-09-14 10:56:50.614082 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2022-09-14 10:56:50.614108 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-3: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2022-09-14 10:56:50.614099 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-4: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2022-09-14 10:56:51.619623 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2022-09-14 10:56:51.619630 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-3: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2022-09-14 10:56:51.619632 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-g-volume-myvolume-client-4: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] The gluster is running with opversion 9 on CentOS. There are no entries in split brain. How can i get these files finally healed? Thanks in advance. Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] random disconnects of peers
I have made extensive load tests in the last few days and figured out it's definitely a network related issue. I changed from jumbo frames (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem doesn't occur. I'm able to bump the io-wait of our gluster storage servers to the max possible values of the disks without any error or connection loss between the hypervisors or the storage nodes. As mentioned in multiple gluster best practices it's recommended to use jumbo frames in gluster setups for better performance. So I would like to use jumbo frames in my datacenter. What could be the issue here? Am 19.08.2022 07:47 schrieb Strahil Nikolov: You can check the max op version and if only the oVirt nodes are using it -> bump it to the maximum. I upgraded my 4.4 while preserving the Gluster storage - just back up the /etc/glusterfs & /var/lib/glusterd . Keep in mind that if you use VDO you need to backup it's config too. Best Regards, Strahil Nikolov Yes, the firmware update of the network adapters is planned for the next week. The tcpdump is currently running and i will share the result with you. The update to ovirt 4.4 (and to 4.5) is quite a big deal because of the switch to CentOS stream where a full reinstall is required and there is no possibility to preserve local storage on standalone hypervisors. :P Gluster opversion is 6. Am 18.08.2022 23:46 schrieb Strahil Nikolov: Usually I start with firmware updates/OS updates. You can be surprised how many times bad firmware (or dieing NIC) has left me puzzled. I also support the tcpdump - let it run on all nodes and it might give a clue what is causing it. I think there is no need to remind you that you should update to oVirt 4.4 and then to 4.5 ;) By the way, what is your cluster OP version ? Best Regards, Strahil Nikolov On Thu, Aug 18, 2022 at 14:27, Péter Károly JUHÁSZ wrote: Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk [1][1] Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users [2][2] Links: -- [1] https://meet.google.com/cpu-eiue-hvk [1] [2] https://lists.gluster.org/mailman/listinfo/gluster-users [2] Links: -- [1] https://meet.google.com/cpu-eiue-hvk [2] https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] random disconnects of peers
Hi folks, i am running multiple GlusterFS servers in multiple datacenters. Every datacenter is basically the same setup: 3x storage nodes, 3x kvm hypervisors (oVirt) and 2x HPE switches which are acting as one logical unit. The NICs of all servers are attached to both switches with a bonding of two NICs, in case one of the switches has a major problem. In one datacenter i have strange problems with the glusterfs for nearly half of a year now and i'm not able to figure out the root cause. Enviorment - glusterfs 9.5 running on a centos 7.9.2009 (Core) - three gluster volumes, all options equally configured root@storage-001# gluster volume info Volume Name: g-volume-domain Type: Replicate Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain Options Reconfigured: client.event-threads: 4 performance.cache-size: 1GB server.event-threads: 4 server.allow-insecure: On network.ping-timeout: 42 performance.client-io-threads: off nfs.disable: on transport.address-family: inet cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.quick-read: off cluster.data-self-heal-algorithm: diff storage.owner-uid: 36 storage.owner-gid: 36 performance.readdir-ahead: on performance.read-ahead: off client.ssl: off server.ssl: off auth.ssl-allow: storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain ssl.cipher-list: HIGH:!SSLv2 cluster.shd-max-threads: 4 diagnostics.latency-measurement: on diagnostics.count-fop-hits: on performance.io-thread-count: 32 Problem The glusterd on one storage node seems to loose connection to one another storage node. If the problem occurs, the first message in /var/log/glusterfs/glusterd.log is always the following (variable values are filled with "x": [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. I will post a filtered log for this specific error on each of my storage nodes below. storage-001: root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:01:28.615441 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 05:34:47.721060 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 06:01:22.472973 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. root@storage-001# storage-002: root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:01:34.502322 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 05:19:16.898406 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 06:01:22.462676 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. [2022-08-16 10:17:52.154501 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (), in state , has disconnected from glusterd. root@storage-002# storage-003: root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log | grep "has disconnected from" | grep "2022-08-16" [2022-08-16 05:24:18.225432 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 05:27:22.683234 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. [2022-08-16 10:17:50.624775 +] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in state , has disconnected from glusterd. root@storage-003# After this message it takes a couple secounds (in specific example of 2022-08-16 it's one to four secounds) and the disconnected node is reachable again: [2022-08-16 05:01:32.110518 +] I [MSGID: 106493]