Re: [Gluster-users] random disconnects of peers

2022-08-18 Thread dpgluster
I just niced all glusterfsd processes on all nodes to a value of -10. 
The problem just occured, so it seems nicing the processes didn't help.


Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ:

What if you renice the gluster processes to some negative value?

  于 2022年8月18日周四 09:45写道:


Hi folks,

i am running multiple GlusterFS servers in multiple datacenters.
Every
datacenter is basically the same setup: 3x storage nodes, 3x kvm
hypervisors (oVirt) and 2x HPE switches which are acting as one
logical
unit. The NICs of all servers are attached to both switches with a
bonding of two NICs, in case one of the switches has a major
problem.
In one datacenter i have strange problems with the glusterfs for
nearly
half of a year now and i'm not able to figure out the root cause.

Enviorment
- glusterfs 9.5 running on a centos 7.9.2009 (Core)
- three gluster volumes, all options equally configured

root@storage-001# gluster volume info
Volume Name: g-volume-domain
Type: Replicate
Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
Options Reconfigured:
client.event-threads: 4
performance.cache-size: 1GB
server.event-threads: 4
server.allow-insecure: On
network.ping-timeout: 42
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.quick-read: off
cluster.data-self-heal-algorithm: diff
storage.owner-uid: 36
storage.owner-gid: 36
performance.readdir-ahead: on
performance.read-ahead: off
client.ssl: off
server.ssl: off
auth.ssl-allow:


storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain

ssl.cipher-list: HIGH:!SSLv2
cluster.shd-max-threads: 4
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
performance.io-thread-count: 32

Problem
The glusterd on one storage node seems to loose connection to one
another storage node. If the problem occurs, the first message in
/var/log/glusterfs/glusterd.log is always the following (variable
values
are filled with "x":
[2022-08-16 05:01:28.615441 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.

I will post a filtered log for this specific error on each of my
storage
nodes below.
storage-001:
root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:28.615441 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.
[2022-08-16 05:34:47.721060 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
[2022-08-16 06:01:22.472973 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.
root@storage-001#

storage-002:
root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:34.502322 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
[2022-08-16 05:19:16.898406 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
[2022-08-16 06:01:22.462676 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
[2022-08-16 10:17:52.154501 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
root@storage-002#

storage-003:
root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:24:18.225432 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.
[2022-08-16 05:27:22.683234 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.
[2022-08-16 10:17:50.624775 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.

Re: [Gluster-users] random disconnects of peers

2022-08-18 Thread dpgluster
Yes, the firmware update of the network adapters is planned for the next 
week.

The tcpdump is currently running and i will share the result with you.
The update to ovirt 4.4 (and to 4.5) is quite a big deal because of the 
switch to CentOS stream where a full reinstall is required and there is 
no possibility to preserve local storage on standalone hypervisors. :P


Gluster opversion is 6.

Am 18.08.2022 23:46 schrieb Strahil Nikolov:

Usually I start with firmware updates/OS updates.
You can be surprised how many times bad firmware (or dieing NIC) has
left me puzzled.

I also support the tcpdump - let it run on all nodes and it might give
a clue what is causing it.

I think there is no need to remind you that you should update to oVirt
4.4 and then to 4.5 ;)

By the way, what is your cluster OP version ?

Best Regards,
Strahil Nikolov


On Thu, Aug 18, 2022 at 14:27, Péter Károly JUHÁSZ
 wrote:


Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk [1]
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users [2]



Links:
--
[1] https://meet.google.com/cpu-eiue-hvk
[2] https://lists.gluster.org/mailman/listinfo/gluster-users





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] gluster volume not healing - remote operation failed

2022-09-14 Thread dpgluster

Hi folks,

my gluster volume isn't fully healing. We had an outage couple days ago 
and all other files got healed successfully. Now - days later - i can 
see there are still two gfid's per node remaining in healing list.


root@storage-001~# for i in `gluster volume list`; do gluster volume 
heal $i info; done

Brick storage-003.mydomain.com:/mnt/bricks/g-volume-myvolume


Status: Connected
Number of entries: 2

Brick storage-002.mydomain.com:/mnt/bricks/g-volume-myvolume


Status: Connected
Number of entries: 2

Brick storage-001.mydomain.com:/mnt/bricks/g-volume-myvolume


Status: Connected
Number of entries: 2

In the log i can see that the glustershd process is invoked to heal the 
reamining files but fails with "remote operation failed".
[2022-09-14 10:56:50.007978 +] I [MSGID: 108026] 
[afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 
0-g-volume-myvolume-replicate-0: performing entry selfheal on 
48791313-e5e7-44df-bf99-3ebc8d4cf5d5
[2022-09-14 10:56:50.008428 +] I [MSGID: 108026] 
[afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 
0-g-volume-myvolume-replicate-0: performing entry selfheal on 
a4babc5a-bd5a-4429-b65e-758651d5727c
[2022-09-14 10:56:50.015005 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-2: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]
[2022-09-14 10:56:50.015007 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-3: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]
[2022-09-14 10:56:50.015138 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-4: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]
[2022-09-14 10:56:50.614082 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-2: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]
[2022-09-14 10:56:50.614108 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-3: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]
[2022-09-14 10:56:50.614099 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-4: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]
[2022-09-14 10:56:51.619623 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-2: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]
[2022-09-14 10:56:51.619630 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-3: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]
[2022-09-14 10:56:51.619632 +] E [MSGID: 114031] 
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 
0-g-volume-myvolume-client-4: remote operation failed. [{path=(null)}, 
{errno=22}, {error=Invalid argument}]


The gluster is running with opversion 9 on CentOS. There are no 
entries in split brain.


How can i get these files finally healed?

Thanks in advance.




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] random disconnects of peers

2022-09-16 Thread dpgluster
I have made extensive load tests in the last few days and figured out 
it's definitely a network related issue. I changed from jumbo frames 
(mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem 
doesn't occur. I'm able to bump the io-wait of our gluster storage 
servers to the max possible values of the disks without any error or 
connection loss between the hypervisors or the storage nodes.


As mentioned in multiple gluster best practices it's recommended to use 
jumbo frames in gluster setups for better performance. So I would like 
to use jumbo frames in my datacenter.


What could be the issue here?


Am 19.08.2022 07:47 schrieb Strahil Nikolov:

You can check the max op version and if only the oVirt nodes are using
it -> bump it to the maximum.

I upgraded my 4.4 while preserving the Gluster storage - just back up
the /etc/glusterfs & /var/lib/glusterd . Keep in mind that if you use
VDO you need to backup it's config too.

Best Regards,
Strahil Nikolov


Yes, the firmware update of the network adapters is planned for the
next
week.
The tcpdump is currently running and i will share the result with
you.
The update to ovirt 4.4 (and to 4.5) is quite a big deal because of
the
switch to CentOS stream where a full reinstall is required and there
is
no possibility to preserve local storage on standalone hypervisors.
:P

Gluster opversion is 6.

Am 18.08.2022 23:46 schrieb Strahil Nikolov:

Usually I start with firmware updates/OS updates.
You can be surprised how many times bad firmware (or dieing NIC)

has

left me puzzled.

I also support the tcpdump - let it run on all nodes and it might

give

a clue what is causing it.

I think there is no need to remind you that you should update to

oVirt

4.4 and then to 4.5 ;)

By the way, what is your cluster OP version ?

Best Regards,
Strahil Nikolov


On Thu, Aug 18, 2022 at 14:27, Péter Károly JUHÁSZ
 wrote:


Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk [1][1]
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users [2][2]



Links:
--
[1] https://meet.google.com/cpu-eiue-hvk [1]



[2] https://lists.gluster.org/mailman/listinfo/gluster-users [2]



Links:
--
[1] https://meet.google.com/cpu-eiue-hvk
[2] https://lists.gluster.org/mailman/listinfo/gluster-users





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] random disconnects of peers

2022-08-18 Thread dpgluster

Hi folks,

i am running multiple GlusterFS servers in multiple datacenters. Every 
datacenter is basically the same setup: 3x storage nodes, 3x kvm 
hypervisors (oVirt) and 2x HPE switches which are acting as one logical 
unit. The NICs of all servers are attached to both switches with a 
bonding of two NICs, in case one of the switches has a major problem.
In one datacenter i have strange problems with the glusterfs for nearly 
half of a year now and i'm not able to figure out the root cause.


Enviorment
- glusterfs 9.5 running on a centos 7.9.2009 (Core)
- three gluster volumes, all options equally configured

root@storage-001# gluster volume info
Volume Name: g-volume-domain
Type: Replicate
Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
Options Reconfigured:
client.event-threads: 4
performance.cache-size: 1GB
server.event-threads: 4
server.allow-insecure: On
network.ping-timeout: 42
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.quick-read: off
cluster.data-self-heal-algorithm: diff
storage.owner-uid: 36
storage.owner-gid: 36
performance.readdir-ahead: on
performance.read-ahead: off
client.ssl: off
server.ssl: off
auth.ssl-allow: 
storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain

ssl.cipher-list: HIGH:!SSLv2
cluster.shd-max-threads: 4
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
performance.io-thread-count: 32

Problem
The glusterd on one storage node seems to loose connection to one 
another storage node. If the problem occurs, the first message in 
/var/log/glusterfs/glusterd.log is always the following (variable values 
are filled with "x":
[2022-08-16 05:01:28.615441 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.


I will post a filtered log for this specific error on each of my storage 
nodes below.

storage-001:
root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log | grep 
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:28.615441 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.
[2022-08-16 05:34:47.721060 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.
[2022-08-16 06:01:22.472973 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.

root@storage-001#

storage-002:
root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log | grep 
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:34.502322 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.
[2022-08-16 05:19:16.898406 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.
[2022-08-16 06:01:22.462676 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.
[2022-08-16 10:17:52.154501 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.

root@storage-002#

storage-003:
root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log | grep 
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:24:18.225432 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.
[2022-08-16 05:27:22.683234 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.
[2022-08-16 10:17:50.624775 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.

root@storage-003#

After this message it takes a couple secounds (in specific example of 
2022-08-16 it's one to four secounds) and the disconnected node is 
reachable again:
[2022-08-16 05:01:32.110518 +] I [MSGID: 106493]