Re: [Gluster-users] random disconnects of peers

2022-09-18 Thread Strahil Nikolov
By the way, try to capture the traffic on the systems and compare if only 
specific packages are not delivered to the destination.
Overall JF won't give you a 2-digit improvement, so in your case I would switch 
to 1500 MTU.
Best Regards,Strahil Nikolov 
 
 
I already updated the firmware of the NICs few weeks ago. The switch 
firmware is up to date. I already changed the whole switch to a 
completely different model without effect before (maybe 6 months ago). 
There are no other systems which are using jumbo frames attached to the 
switch.


Am 18.09.2022 21:07 schrieb Strahil Nikolov:
> We are currently shooting in the dark...
> If possible update the Firmware of the NICs and FW of the switch .
> 
> Have you tried if other systems (on the same switch) have issues with
> the Jumbo Frames ?
> 
> Best Regards,
> Strahil Nikolov
> 
>> Yes, i did test the ping with a jumbo frame mtu and it worked
>> without
>> problems. There is no firewall between the storage nodes and the
>> hypervisors. They are using the same layer 2 subnet, so there is
>> only
>> the switch in between. On the switch jumbo frames for the specific
>> wlan
>> is enabled.
>> 
>> I also increased the tx and rx queue length, without succes in
>> relation
>> to the problem.
>> 
>> Am 17.09.2022 10:39 schrieb Strahil Nikolov:
>>> Usually that kind of problems could be on many places.
>>> When you set the MTU to 9000, did you test with ping and the "Do
>> not
>>> fragment" Flag ?
>>> 
>>> If there is a device on the path that is not configured (or
>> doesn't
>>> support MTU9000) , it will fragment all packets and that could
>> lead to
>>> excessive device CPU consumption. I have seen many firewalls to
>> not
>>> use JF by default.
>>> 
>>> ping  -M do -s 8972
>>> 
>>> Best Regards,
>>> Strahil Nikolov
>>> 
>>> В петък, 16 септември 2022 г., 22:24:14 ч.
>>> Гринуич+3, Gionatan Danti 
>> написа:
>>> 
>>> Il 2022-09-16 18:41 dpglus...@posteo.de ha scritto:
 I have made extensive load tests in the last few days and figured
>>> out
 it's definitely a network related issue. I changed from jumbo
>> frames
 (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem
 doesn't occur. I'm able to bump the io-wait of our gluster
>> storage
 servers to the max possible values of the disks without any error
>> or
 connection loss between the hypervisors or the storage nodes.
 
 As mentioned in multiple gluster best practices it's recommended
>> to
 use jumbo frames in gluster setups for better performance. So I
>>> would
 like to use jumbo frames in my datacenter.
 
 What could be the issue here?
>>> 
>>> I would try with a jumbo frame setting of 4074 (or 4088) bytes.
>>> 
>>> Regards.
>>> 
>>> --
>>> Danti Gionatan
>>> Supporto Tecnico
>>> Assyoma S.r.l. - www.assyoma.it
>>> email: g.da...@assyoma.it - i...@assyoma.it
>>> GPG public key ID: FF5F32A8
  




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] random disconnects of peers

2022-09-18 Thread Strahil Nikolov
We are currently shooting in the dark...If possible update the Firmware of the 
NICs and FW of the switch .
Have you tried if other systems (on the same switch) have issues with the Jumbo 
Frames ?
Best Regards,Strahil Nikolov 
 
 
Yes, i did test the ping with a jumbo frame mtu and it worked without 
problems. There is no firewall between the storage nodes and the 
hypervisors. They are using the same layer 2 subnet, so there is only 
the switch in between. On the switch jumbo frames for the specific wlan 
is enabled.

I also increased the tx and rx queue length, without succes in relation 
to the problem.

Am 17.09.2022 10:39 schrieb Strahil Nikolov:
> Usually that kind of problems could be on many places.
> When you set the MTU to 9000, did you test with ping and the "Do not
> fragment" Flag ?
> 
> If there is a device on the path that is not configured (or doesn't
> support MTU9000) , it will fragment all packets and that could lead to
> excessive device CPU consumption. I have seen many firewalls to not
> use JF by default.
> 
> ping  -M do -s 8972
> 
> Best Regards,
> Strahil Nikolov
> 
>  В петък, 16 септември 2022 г., 22:24:14 ч.
> Гринуич+3, Gionatan Danti  написа:
> 
> Il 2022-09-16 18:41 dpglus...@posteo.de ha scritto:
>> I have made extensive load tests in the last few days and figured
> out
>> it's definitely a network related issue. I changed from jumbo frames
>> (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem
>> doesn't occur. I'm able to bump the io-wait of our gluster storage
>> servers to the max possible values of the disks without any error or
>> connection loss between the hypervisors or the storage nodes.
>> 
>> As mentioned in multiple gluster best practices it's recommended to
>> use jumbo frames in gluster setups for better performance. So I
> would
>> like to use jumbo frames in my datacenter.
>> 
>> What could be the issue here?
> 
> I would try with a jumbo frame setting of 4074 (or 4088) bytes.
> 
> Regards.
> 
> --
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.da...@assyoma.it - i...@assyoma.it
> GPG public key ID: FF5F32A8
  




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] random disconnects of peers

2022-09-17 Thread Strahil Nikolov
 Usually that kind of problems could be on many places.
When you set the MTU to 9000, did you test with ping and the "Do not fragment" 
Flag ?

If there is a device on the path that is not configured (or doesn't support 
MTU9000) , it will fragment all packets and that could lead to excessive device 
CPU consumption. I have seen many firewalls to not use JF by default.


ping  -M do -s 8972

Best Regards,
Strahil Nikolov

 В петък, 16 септември 2022 г., 22:24:14 ч. Гринуич+3, Gionatan Danti 
 написа:  
 
 Il 2022-09-16 18:41 dpglus...@posteo.de ha scritto:
> I have made extensive load tests in the last few days and figured out
> it's definitely a network related issue. I changed from jumbo frames
> (mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem
> doesn't occur. I'm able to bump the io-wait of our gluster storage
> servers to the max possible values of the disks without any error or
> connection loss between the hypervisors or the storage nodes.
> 
> As mentioned in multiple gluster best practices it's recommended to
> use jumbo frames in gluster setups for better performance. So I would
> like to use jumbo frames in my datacenter.
> 
> What could be the issue here?

I would try with a jumbo frame setting of 4074 (or 4088) bytes.
Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.da...@assyoma.it - i...@assyoma.it
GPG public key ID: FF5F32A8
  



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] random disconnects of peers

2022-09-16 Thread Gionatan Danti

Il 2022-09-16 18:41 dpglus...@posteo.de ha scritto:

I have made extensive load tests in the last few days and figured out
it's definitely a network related issue. I changed from jumbo frames
(mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem
doesn't occur. I'm able to bump the io-wait of our gluster storage
servers to the max possible values of the disks without any error or
connection loss between the hypervisors or the storage nodes.

As mentioned in multiple gluster best practices it's recommended to
use jumbo frames in gluster setups for better performance. So I would
like to use jumbo frames in my datacenter.

What could be the issue here?


I would try with a jumbo frame setting of 4074 (or 4088) bytes.
Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.da...@assyoma.it - i...@assyoma.it
GPG public key ID: FF5F32A8




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] random disconnects of peers

2022-09-16 Thread dpgluster
I have made extensive load tests in the last few days and figured out 
it's definitely a network related issue. I changed from jumbo frames 
(mtu 9000) to default mtu of 1500. With a mtu of 1500 the problem 
doesn't occur. I'm able to bump the io-wait of our gluster storage 
servers to the max possible values of the disks without any error or 
connection loss between the hypervisors or the storage nodes.


As mentioned in multiple gluster best practices it's recommended to use 
jumbo frames in gluster setups for better performance. So I would like 
to use jumbo frames in my datacenter.


What could be the issue here?


Am 19.08.2022 07:47 schrieb Strahil Nikolov:

You can check the max op version and if only the oVirt nodes are using
it -> bump it to the maximum.

I upgraded my 4.4 while preserving the Gluster storage - just back up
the /etc/glusterfs & /var/lib/glusterd . Keep in mind that if you use
VDO you need to backup it's config too.

Best Regards,
Strahil Nikolov


Yes, the firmware update of the network adapters is planned for the
next
week.
The tcpdump is currently running and i will share the result with
you.
The update to ovirt 4.4 (and to 4.5) is quite a big deal because of
the
switch to CentOS stream where a full reinstall is required and there
is
no possibility to preserve local storage on standalone hypervisors.
:P

Gluster opversion is 6.

Am 18.08.2022 23:46 schrieb Strahil Nikolov:

Usually I start with firmware updates/OS updates.
You can be surprised how many times bad firmware (or dieing NIC)

has

left me puzzled.

I also support the tcpdump - let it run on all nodes and it might

give

a clue what is causing it.

I think there is no need to remind you that you should update to

oVirt

4.4 and then to 4.5 ;)

By the way, what is your cluster OP version ?

Best Regards,
Strahil Nikolov


On Thu, Aug 18, 2022 at 14:27, Péter Károly JUHÁSZ
 wrote:


Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk [1][1]
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users [2][2]



Links:
--
[1] https://meet.google.com/cpu-eiue-hvk [1]



[2] https://lists.gluster.org/mailman/listinfo/gluster-users [2]



Links:
--
[1] https://meet.google.com/cpu-eiue-hvk
[2] https://lists.gluster.org/mailman/listinfo/gluster-users





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] random disconnects of peers

2022-08-18 Thread dpgluster
Yes, the firmware update of the network adapters is planned for the next 
week.

The tcpdump is currently running and i will share the result with you.
The update to ovirt 4.4 (and to 4.5) is quite a big deal because of the 
switch to CentOS stream where a full reinstall is required and there is 
no possibility to preserve local storage on standalone hypervisors. :P


Gluster opversion is 6.

Am 18.08.2022 23:46 schrieb Strahil Nikolov:

Usually I start with firmware updates/OS updates.
You can be surprised how many times bad firmware (or dieing NIC) has
left me puzzled.

I also support the tcpdump - let it run on all nodes and it might give
a clue what is causing it.

I think there is no need to remind you that you should update to oVirt
4.4 and then to 4.5 ;)

By the way, what is your cluster OP version ?

Best Regards,
Strahil Nikolov


On Thu, Aug 18, 2022 at 14:27, Péter Károly JUHÁSZ
 wrote:


Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk [1]
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users [2]



Links:
--
[1] https://meet.google.com/cpu-eiue-hvk
[2] https://lists.gluster.org/mailman/listinfo/gluster-users





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] random disconnects of peers

2022-08-18 Thread Péter Károly JUHÁSZ
Did you tired to TCPdump the connections to see who and how closes the
connection? Normal fin-ack, or timeout? Maybe some network device between?
(This later has small probably since you told that you can trigger the
error by high load.)

 于 2022年8月18日周四 12:38写道:

> I just niced all glusterfsd processes on all nodes to a value of -10.
> The problem just occured, so it seems nicing the processes didn't help.
>
> Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ:
> > What if you renice the gluster processes to some negative value?
> >
> >   于 2022年8月18日周四 09:45写道:
> >
> >> Hi folks,
> >>
> >> i am running multiple GlusterFS servers in multiple datacenters.
> >> Every
> >> datacenter is basically the same setup: 3x storage nodes, 3x kvm
> >> hypervisors (oVirt) and 2x HPE switches which are acting as one
> >> logical
> >> unit. The NICs of all servers are attached to both switches with a
> >> bonding of two NICs, in case one of the switches has a major
> >> problem.
> >> In one datacenter i have strange problems with the glusterfs for
> >> nearly
> >> half of a year now and i'm not able to figure out the root cause.
> >>
> >> Enviorment
> >> - glusterfs 9.5 running on a centos 7.9.2009 (Core)
> >> - three gluster volumes, all options equally configured
> >>
> >> root@storage-001# gluster volume info
> >> Volume Name: g-volume-domain
> >> Type: Replicate
> >> Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
> >> Status: Started
> >> Snapshot Count: 0
> >> Number of Bricks: 1 x 3 = 3
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
> >> Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
> >> Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
> >> Options Reconfigured:
> >> client.event-threads: 4
> >> performance.cache-size: 1GB
> >> server.event-threads: 4
> >> server.allow-insecure: On
> >> network.ping-timeout: 42
> >> performance.client-io-threads: off
> >> nfs.disable: on
> >> transport.address-family: inet
> >> cluster.quorum-type: auto
> >> network.remote-dio: enable
> >> cluster.eager-lock: enable
> >> performance.stat-prefetch: off
> >> performance.io-cache: off
> >> performance.quick-read: off
> >> cluster.data-self-heal-algorithm: diff
> >> storage.owner-uid: 36
> >> storage.owner-gid: 36
> >> performance.readdir-ahead: on
> >> performance.read-ahead: off
> >> client.ssl: off
> >> server.ssl: off
> >> auth.ssl-allow:
> >>
> >
> storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain
> >> ssl.cipher-list: HIGH:!SSLv2
> >> cluster.shd-max-threads: 4
> >> diagnostics.latency-measurement: on
> >> diagnostics.count-fop-hits: on
> >> performance.io-thread-count: 32
> >>
> >> Problem
> >> The glusterd on one storage node seems to loose connection to one
> >> another storage node. If the problem occurs, the first message in
> >> /var/log/glusterfs/glusterd.log is always the following (variable
> >> values
> >> are filled with "x":
> >> [2022-08-16 05:01:28.615441 +] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >>  (),
> >> in
> >> state , has disconnected from glusterd.
> >>
> >> I will post a filtered log for this specific error on each of my
> >> storage
> >> nodes below.
> >> storage-001:
> >> root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log |
> >> grep
> >> "has disconnected from" | grep "2022-08-16"
> >> [2022-08-16 05:01:28.615441 +] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >>  (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
> >> in
> >> state , has disconnected from glusterd.
> >> [2022-08-16 05:34:47.721060 +] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >>  (),
> >> in
> >> state , has disconnected from glusterd.
> >> [2022-08-16 06:01:22.472973 +] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >>  (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
> >> in
> >> state , has disconnected from glusterd.
> >> root@storage-001#
> >>
> >> storage-002:
> >> root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log |
> >> grep
> >> "has disconnected from" | grep "2022-08-16"
> >> [2022-08-16 05:01:34.502322 +] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >>  (),
> >> in
> >> state , has disconnected from glusterd.
> >> [2022-08-16 05:19:16.898406 +] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >>  (),
> >> in
> >> state , has disconnected from glusterd.
> >> [2022-08-16 06:01:22.462676 +] I [MSGID: 106004]
> >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
> >> Peer
> >>  (),
> >> in
> >> state , has disconnected from glusterd.
> >> [2022-08-16 10:17:52.154501 +] I [MSGID: 106004]
> >> 

Re: [Gluster-users] random disconnects of peers

2022-08-18 Thread dpgluster
I just niced all glusterfsd processes on all nodes to a value of -10. 
The problem just occured, so it seems nicing the processes didn't help.


Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ:

What if you renice the gluster processes to some negative value?

  于 2022年8月18日周四 09:45写道:


Hi folks,

i am running multiple GlusterFS servers in multiple datacenters.
Every
datacenter is basically the same setup: 3x storage nodes, 3x kvm
hypervisors (oVirt) and 2x HPE switches which are acting as one
logical
unit. The NICs of all servers are attached to both switches with a
bonding of two NICs, in case one of the switches has a major
problem.
In one datacenter i have strange problems with the glusterfs for
nearly
half of a year now and i'm not able to figure out the root cause.

Enviorment
- glusterfs 9.5 running on a centos 7.9.2009 (Core)
- three gluster volumes, all options equally configured

root@storage-001# gluster volume info
Volume Name: g-volume-domain
Type: Replicate
Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
Options Reconfigured:
client.event-threads: 4
performance.cache-size: 1GB
server.event-threads: 4
server.allow-insecure: On
network.ping-timeout: 42
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.quick-read: off
cluster.data-self-heal-algorithm: diff
storage.owner-uid: 36
storage.owner-gid: 36
performance.readdir-ahead: on
performance.read-ahead: off
client.ssl: off
server.ssl: off
auth.ssl-allow:


storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain

ssl.cipher-list: HIGH:!SSLv2
cluster.shd-max-threads: 4
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
performance.io-thread-count: 32

Problem
The glusterd on one storage node seems to loose connection to one
another storage node. If the problem occurs, the first message in
/var/log/glusterfs/glusterd.log is always the following (variable
values
are filled with "x":
[2022-08-16 05:01:28.615441 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.

I will post a filtered log for this specific error on each of my
storage
nodes below.
storage-001:
root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:28.615441 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.
[2022-08-16 05:34:47.721060 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
[2022-08-16 06:01:22.472973 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.
root@storage-001#

storage-002:
root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:34.502322 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
[2022-08-16 05:19:16.898406 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
[2022-08-16 06:01:22.462676 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
[2022-08-16 10:17:52.154501 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (),
in
state , has disconnected from glusterd.
root@storage-002#

storage-003:
root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:24:18.225432 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.
[2022-08-16 05:27:22.683234 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.
[2022-08-16 10:17:50.624775 +] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state , has disconnected from glusterd.

Re: [Gluster-users] random disconnects of peers

2022-08-18 Thread Péter Károly JUHÁSZ
What if you renice the gluster processes to some negative value?

 于 2022年8月18日周四 09:45写道:

> Hi folks,
>
> i am running multiple GlusterFS servers in multiple datacenters. Every
> datacenter is basically the same setup: 3x storage nodes, 3x kvm
> hypervisors (oVirt) and 2x HPE switches which are acting as one logical
> unit. The NICs of all servers are attached to both switches with a
> bonding of two NICs, in case one of the switches has a major problem.
> In one datacenter i have strange problems with the glusterfs for nearly
> half of a year now and i'm not able to figure out the root cause.
>
> Enviorment
> - glusterfs 9.5 running on a centos 7.9.2009 (Core)
> - three gluster volumes, all options equally configured
>
> root@storage-001# gluster volume info
> Volume Name: g-volume-domain
> Type: Replicate
> Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
> Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
> Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
> Options Reconfigured:
> client.event-threads: 4
> performance.cache-size: 1GB
> server.event-threads: 4
> server.allow-insecure: On
> network.ping-timeout: 42
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> cluster.quorum-type: auto
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.quick-read: off
> cluster.data-self-heal-algorithm: diff
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.readdir-ahead: on
> performance.read-ahead: off
> client.ssl: off
> server.ssl: off
> auth.ssl-allow:
>
> storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain
> ssl.cipher-list: HIGH:!SSLv2
> cluster.shd-max-threads: 4
> diagnostics.latency-measurement: on
> diagnostics.count-fop-hits: on
> performance.io-thread-count: 32
>
> Problem
> The glusterd on one storage node seems to loose connection to one
> another storage node. If the problem occurs, the first message in
> /var/log/glusterfs/glusterd.log is always the following (variable values
> are filled with "x":
> [2022-08-16 05:01:28.615441 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (), in
> state , has disconnected from glusterd.
>
> I will post a filtered log for this specific error on each of my storage
> nodes below.
> storage-001:
> root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log | grep
> "has disconnected from" | grep "2022-08-16"
> [2022-08-16 05:01:28.615441 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state , has disconnected from glusterd.
> [2022-08-16 05:34:47.721060 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (), in
> state , has disconnected from glusterd.
> [2022-08-16 06:01:22.472973 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state , has disconnected from glusterd.
> root@storage-001#
>
> storage-002:
> root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log | grep
> "has disconnected from" | grep "2022-08-16"
> [2022-08-16 05:01:34.502322 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (), in
> state , has disconnected from glusterd.
> [2022-08-16 05:19:16.898406 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (), in
> state , has disconnected from glusterd.
> [2022-08-16 06:01:22.462676 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (), in
> state , has disconnected from glusterd.
> [2022-08-16 10:17:52.154501 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (), in
> state , has disconnected from glusterd.
> root@storage-002#
>
> storage-003:
> root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log | grep
> "has disconnected from" | grep "2022-08-16"
> [2022-08-16 05:24:18.225432 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state , has disconnected from glusterd.
> [2022-08-16 05:27:22.683234 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state , has disconnected from glusterd.
> [2022-08-16 10:17:50.624775 +] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>  (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> 

[Gluster-users] random disconnects of peers

2022-08-18 Thread dpgluster

Hi folks,

i am running multiple GlusterFS servers in multiple datacenters. Every 
datacenter is basically the same setup: 3x storage nodes, 3x kvm 
hypervisors (oVirt) and 2x HPE switches which are acting as one logical 
unit. The NICs of all servers are attached to both switches with a 
bonding of two NICs, in case one of the switches has a major problem.
In one datacenter i have strange problems with the glusterfs for nearly 
half of a year now and i'm not able to figure out the root cause.


Enviorment
- glusterfs 9.5 running on a centos 7.9.2009 (Core)
- three gluster volumes, all options equally configured

root@storage-001# gluster volume info
Volume Name: g-volume-domain
Type: Replicate
Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
Options Reconfigured:
client.event-threads: 4
performance.cache-size: 1GB
server.event-threads: 4
server.allow-insecure: On
network.ping-timeout: 42
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.quick-read: off
cluster.data-self-heal-algorithm: diff
storage.owner-uid: 36
storage.owner-gid: 36
performance.readdir-ahead: on
performance.read-ahead: off
client.ssl: off
server.ssl: off
auth.ssl-allow: 
storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain

ssl.cipher-list: HIGH:!SSLv2
cluster.shd-max-threads: 4
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
performance.io-thread-count: 32

Problem
The glusterd on one storage node seems to loose connection to one 
another storage node. If the problem occurs, the first message in 
/var/log/glusterfs/glusterd.log is always the following (variable values 
are filled with "x":
[2022-08-16 05:01:28.615441 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.


I will post a filtered log for this specific error on each of my storage 
nodes below.

storage-001:
root@storage-001# tail -n 10 /var/log/glusterfs/glusterd.log | grep 
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:28.615441 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.
[2022-08-16 05:34:47.721060 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.
[2022-08-16 06:01:22.472973 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.

root@storage-001#

storage-002:
root@storage-002# tail -n 10 /var/log/glusterfs/glusterd.log | grep 
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:34.502322 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.
[2022-08-16 05:19:16.898406 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.
[2022-08-16 06:01:22.462676 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.
[2022-08-16 10:17:52.154501 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (), in 
state , has disconnected from glusterd.

root@storage-002#

storage-003:
root@storage-003# tail -n 10 /var/log/glusterfs/glusterd.log | grep 
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:24:18.225432 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.
[2022-08-16 05:27:22.683234 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.
[2022-08-16 10:17:50.624775 +] I [MSGID: 106004] 
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer 
 (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in 
state , has disconnected from glusterd.

root@storage-003#

After this message it takes a couple secounds (in specific example of 
2022-08-16 it's one to four secounds) and the disconnected node is 
reachable again:
[2022-08-16 05:01:32.110518 +] I [MSGID: 106493]